What does the learning rate of 2 or 5 or 10 mean (different from 2e-5 or 1e-4), in fine-tuning?

Hello!

I am EXTREMELY new to LLMs, so this might be a very stupid question, and if that’s the case, I apologize in advance.

I have been reading some articles and papers that mention the leaning rate of the following types: 2e-5 or 1e-4.

Why does the learning rate hyper-parameter for fine-tuning GPT3.5 look so different? Am I missing something?

Thank you so much in advance!

1 Like

Hi @shuklaeshita0209 and welcome to the community!

Most likely reason is that it’s a scaling parameter, rather than a direct LR that is sent to the model training.

2 Likes

It’s not a stupid question at all.

LLMs have huge numbers as mentioned above by @platypus. The reason for these very small LR numbers are:

LLMs Are Huge: With billions of parameters, even tiny changes can have a big effect. A small learning rate avoids instability.

Preserve Pre-trained Knowledge: Fine-tuning tweaks the model for a new task without overwriting what it already knows. Small steps make this possible.

1 Like

Hello!

Okay that makes. Thank you for taking the time to answer!

1 Like

Okay, I will try a few different LR and figure out what works better for my dataset (will try smaller values). Thank you so much!