fix NAN loss of rope long context training #399

inkcherry · 2024-06-05T07:24:20Z

based on ( #392) , we got NAN loss during long centext training using ds-sp(ulysses) for a llama style model.

We found that this issue is caused by precision problems. Half precision of rope sequence representation leads to loss in long context. Similar modifications have also been applied to the transformers.
https://github.com/huggingface/transformers/blob/63fb253df0d976b95d9b4b9a7b0012e5f8a37896/src/transformers/models/llama/modeling_llama.py#L111

shrutiramesh1988 · 2024-06-13T03:09:26Z

Even with this fix, I'm still facing loss=nan issues when trying to run the llama2 pre-training on single/multiple nodes with BF16, ZeRO stage 1, --use-rotary-position-embeddings and a sequence length of 4096. Could you kindly help.

fix rope precision for long context

8bda975

inkcherry requested review from GuanhuaWang, arashb, awan-10, duli2012, eltonzheng, mrwyattii, tjruwase and xiaoxiawu-microsoft as code owners June 5, 2024 07:24

inkcherry mentioned this pull request Jun 5, 2024

Sequence Parallel is incompatible with Rotary Positional Embedding #385

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix NAN loss of rope long context training #399

fix NAN loss of rope long context training #399

Uh oh!

inkcherry commented Jun 5, 2024 •

edited

Loading

Uh oh!

shrutiramesh1988 commented Jun 13, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix NAN loss of rope long context training #399

Are you sure you want to change the base?

fix NAN loss of rope long context training #399

Uh oh!

Conversation

inkcherry commented Jun 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shrutiramesh1988 commented Jun 13, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

inkcherry commented Jun 5, 2024 •

edited

Loading