-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Device Agnostic Pipeline #140
base: sycl-develop
Are you sure you want to change the base?
Device Agnostic Pipeline #140
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the purpose of this PR?
Have an example a user can try out on any HW and get decent performance?
We don't expect anyone to really use this for anything right?
The GEMM isn't really device agonistic, is it? It's more that likely any GPU a user would run on has features assumed by the SM_70 pipeline.
If so would it be sufficient to only add the example, but not the new builder and instead directly use the SM_70 builder as a baseline for most current GPUs?
/// Prints the usage statement. | ||
std::ostream & print_usage(std::ostream &out) const { | ||
|
||
out << "PVC GEMM Example\n\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still "PVC"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
/// Prints the usage statement. | ||
std::ostream & print_usage(std::ostream &out) const { | ||
|
||
out << "PVC GEMM Example\n\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
// Run examples | ||
// | ||
|
||
// The KernelHardwareInfo struct holds the number of EUs on the GPU with a given device ID. This |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"EU" isn't a general term
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
#endif | ||
|
||
using TiledMMA = TiledMMA<MMA_Atom<UniversalFMA<ElementAccumulator, ElementA, ElementB, ElementAccumulator>>, | ||
Layout<Shape<_4, _4, _1>>>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this shape?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would result in a work-group size of 16, it's small enough that it would run on any device, hence the size. No other reason in particular
The SM70 mainloop is device agnostic, it implements a tiled GEMM algorithm, with data being blocked in shared memory and registers. With us passing the UniversalCopy and UniversalMMA, this would become a truly device agnostic gemm.
SM70 does not have a collective builder. Also, I believe the idea is that the API accepts something like a |
Adds
CollectiveMMA
and aCollectiveBuilder
API for device agnostic pipeline.This piggybacks off of the SM_70 2 stage gemm pipeline, with blocking in SMem and RMem, to get somewhat performant
gemm
on any device.