Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Device Agnostic Pipeline #140

Open
wants to merge 8 commits into
base: sycl-develop
Choose a base branch
from

Conversation

AD2605
Copy link
Collaborator

@AD2605 AD2605 commented Oct 2, 2024

Adds CollectiveMMA and a CollectiveBuilder API for device agnostic pipeline.
This piggybacks off of the SM_70 2 stage gemm pipeline, with blocking in SMem and RMem, to get somewhat performant gemm on any device.

@AD2605 AD2605 marked this pull request as ready for review October 9, 2024 14:30
@AD2605 AD2605 changed the title [DRAFT] Device Agnostic Pipeline Device Agnostic Pipeline Oct 10, 2024
Copy link
Collaborator

@rolandschulz rolandschulz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of this PR?
Have an example a user can try out on any HW and get decent performance?
We don't expect anyone to really use this for anything right?
The GEMM isn't really device agonistic, is it? It's more that likely any GPU a user would run on has features assumed by the SM_70 pipeline.
If so would it be sufficient to only add the example, but not the new builder and instead directly use the SM_70 builder as a baseline for most current GPUs?

/// Prints the usage statement.
std::ostream & print_usage(std::ostream &out) const {

out << "PVC GEMM Example\n\n"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still "PVC"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

/// Prints the usage statement.
std::ostream & print_usage(std::ostream &out) const {

out << "PVC GEMM Example\n\n"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here too

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

// Run examples
//

// The KernelHardwareInfo struct holds the number of EUs on the GPU with a given device ID. This
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"EU" isn't a general term

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

#endif

using TiledMMA = TiledMMA<MMA_Atom<UniversalFMA<ElementAccumulator, ElementA, ElementB, ElementAccumulator>>,
Layout<Shape<_4, _4, _1>>>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this shape?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would result in a work-group size of 16, it's small enough that it would run on any device, hence the size. No other reason in particular

@AD2605
Copy link
Collaborator Author

AD2605 commented Oct 17, 2024

The GEMM isn't really device agonistic, is it? It's more that likely any GPU a user would run on has features assumed by the SM_70 pipeline.

The SM70 mainloop is device agnostic, it implements a tiled GEMM algorithm, with data being blocked in shared memory and registers. With us passing the UniversalCopy and UniversalMMA, this would become a truly device agnostic gemm.

If so would it be sufficient to only add the example, but not the new builder and instead directly use the SM_70 builder as a baseline for most current GPUs?

SM70 does not have a collective builder. Also, I believe the idea is that the API accepts something like a DeviceAgnostic arch, instead of we relying on the user to actually understand that the sm_70 pipeline could potentially be turned to Device Agnostic one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants