Updates to support new sdpa function #458

jainapurva · 2024-07-13T04:43:04Z

No description provided.

awgu · 2024-07-14T16:20:33Z

For my understanding, is it possible for SDPA to detect that the input key/value shapes are targeting GQA so that we do not need to pass enable_gqa=True?

In this case, the key/value shapes are not broadcastable, so there should not be any uncertainty for semantics.

drisspg · 2024-07-15T01:06:44Z

@awgu Had much discussion on this one with different folks around the org. TLDR yes its completely possible to recognize when users want to do this and enable it if the shapes work. This however doesnt fit naturally into the existing broadcasting semantics. So in theory it is possible for users to "make a mistake" pass in mishaped inputs and not get an error. The consensus was to add this extra check so that users give strong single of their intention.

### Approach: Using the current function declaration **Constraint:** Q_Heads % KV_Heads == 0 **Major change:** - Added a new argument enable_gqa: bool to sdpa function call - It adds a meaning to the last third dimension. Sample use cases this would enable: LLama3 ``` # LLama3 8b call to SDPA query = torch.rand(batch, 32, seq_len_q, D) key = torch.rand(batch, 8, seq_len_kv, D) value = torch.rand(batch, 8, seq_len_kv, D) output = scaled_dot_product_attention(query, key, value, is_causal=True, enable_gqa=True) # Output Shape (batch, 32, seq_len_q, D) ``` ### Design Choice: - Check if Query.size(-3) == Key.size(-3) == Value.size(-3) or, Query.size(-3) % Key.size(-3) == 0 - The function adjusts the key and value tensors to match the query tensor's head dimension by using repeat_interleave if their number of heads are not equal, facilitating correct and efficient computation in attention mechanisms. - By default the enable_gqa flag is set to False, which ensures that regular sdpa functionality remains unchanged. ### Benchmarks: - **sdpa.py: #130634** For different batch sizes enable_gqa=True shows a substansial improvement in the run_time of sdpa | batch_size | q_num_heads | kv_num_heads | q_seq_len | kv_seq_len | embed_dim | forward_time when enable_gqa=True | forward_time when enable_gqa=False | | ------------ | ------------- | -------------- | ----------- | ------------ | ----------- | ----------- | ---------------- | | 1 | 32 | 8 | 2048 | 2048 | 2048 | 100.71 | 119.70 | | 8 | 32 | 8 | 2048 | 2048 | 2048 | 539.78 | 628.83 | | 16 | 32 | 8 | 2048 | 2048 | 2048 | 1056.81 | 1225.48 | | 32 | 32 | 8 | 2048 | 2048 | 2048 | 2099.54 | 2440.45 | ![Screenshot 2024-07-25 at 9 07 40 PM](https://github.com/user-attachments/assets/a3e5f716-c39f-4096-9e6c-82a735e57b7b) - **TorchTitan: pytorch/torchtitan#458 Pull Request resolved: #128898 Approved by: https://github.com/drisspg

### Approach: Using the current function declaration **Constraint:** Q_Heads % KV_Heads == 0 **Major change:** - Added a new argument enable_gqa: bool to sdpa function call - It adds a meaning to the last third dimension. Sample use cases this would enable: LLama3 ``` # LLama3 8b call to SDPA query = torch.rand(batch, 32, seq_len_q, D) key = torch.rand(batch, 8, seq_len_kv, D) value = torch.rand(batch, 8, seq_len_kv, D) output = scaled_dot_product_attention(query, key, value, is_causal=True, enable_gqa=True) # Output Shape (batch, 32, seq_len_q, D) ``` ### Design Choice: - Check if Query.size(-3) == Key.size(-3) == Value.size(-3) or, Query.size(-3) % Key.size(-3) == 0 - The function adjusts the key and value tensors to match the query tensor's head dimension by using repeat_interleave if their number of heads are not equal, facilitating correct and efficient computation in attention mechanisms. - By default the enable_gqa flag is set to False, which ensures that regular sdpa functionality remains unchanged. ### Benchmarks: - **sdpa.py: #130634** For different batch sizes enable_gqa=True shows a substansial improvement in the run_time of sdpa | batch_size | q_num_heads | kv_num_heads | q_seq_len | kv_seq_len | embed_dim | forward_time when enable_gqa=True | forward_time when enable_gqa=False | | ------------ | ------------- | -------------- | ----------- | ------------ | ----------- | ----------- | ---------------- | | 1 | 32 | 8 | 2048 | 2048 | 2048 | 100.71 | 119.70 | | 8 | 32 | 8 | 2048 | 2048 | 2048 | 539.78 | 628.83 | | 16 | 32 | 8 | 2048 | 2048 | 2048 | 1056.81 | 1225.48 | | 32 | 32 | 8 | 2048 | 2048 | 2048 | 2099.54 | 2440.45 | ![Screenshot 2024-07-25 at 9 07 40 PM](https://github.com/user-attachments/assets/a3e5f716-c39f-4096-9e6c-82a735e57b7b) - **TorchTitan: pytorch/torchtitan#458 Differential Revision: D60772086 Pull Request resolved: #132689 Approved by: https://github.com/drisspg

Updates to support new sdpa

b694303

jainapurva requested a review from drisspg July 13, 2024 04:43

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 13, 2024

jainapurva mentioned this pull request Jul 25, 2024

Grouped Query Attention pytorch/pytorch#128898

Closed

jainapurva mentioned this pull request Aug 5, 2024

Grouped Query Attention pytorch/pytorch#132689

Closed

tianyu-l force-pushed the gqa_support branch from c24d5f0 to b694303 Compare August 16, 2024 21:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updates to support new sdpa function #458

Updates to support new sdpa function #458

jainapurva commented Jul 13, 2024

awgu commented Jul 14, 2024

drisspg commented Jul 15, 2024

Updates to support new sdpa function #458

Are you sure you want to change the base?

Updates to support new sdpa function #458

Conversation

jainapurva commented Jul 13, 2024

awgu commented Jul 14, 2024

drisspg commented Jul 15, 2024