Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(train.py): mfu estimation to respect CPU-GPU sync point #527

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

JasonLiJT
Copy link

@JasonLiJT JasonLiJT commented Jun 23, 2024

Previously, the mfu timing measurement was taken before the CPU-GPU sync point at every iter. The resulting running_mfu:

  • would converge correctly when log_interval = 1.
  • could converge to > 100% when log_interval > 1.
    • This could create the illusion that bumping log_interval speeds up training (it usually does not).

See diagrams below.

log_interval = 1

mfu drawio

log_interval = 2

Note that t3 - t2 is discarded. Only t2 - t1 and t4 - t3 etc contribute to running_mfu.
mfu2 drawio

Previously, the mfu timing measurement was taken before the CPU-GPU sync point at every iter. The resulting `running_mfu`:
- would converge correctly when `log_interval = 1`.
- could converge to > 100% when `log_interval > 1`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant