Pretraining a Model From Scratch. Help a dude

Hi Developers,

So, after being fascinated by ChatGPT, I decided to take a deep exploration into how it works, which eventually led me on a path to develop a transformer model from scratch. It took me months, but it has been worth it, and the proof of concept seems functional (I can take a smaller dataset and all the processes involved work; data preprocessing, training, and inference).

I am nearing the point where I would like to pretrain. I have prepared 1 billion tokens as my dataset (I have 300M and 120M tokens as backups just in case I can’t afford the 1B). I want this to be a smaller model, and my budget constraint is enough to utilize the RTX A5000 GPU for a week. My model is autoregressive, and I want it solely for text generation.

My worry is that I feel like I am not ready to kickstart the pretraining itself. I fear that my chosen learning rate (1e-5) might not be the optimal LR for the dataset. I don’t want to pretrain the whole week only to get inadequate performance, which would require me to pretrain again. I would love to make this my breakthrough. I feel like I have done enough research to near the pretraining stage.

So my request is for advice only:

  • How do I know that I am ready to pretrain?
  • How do I know that this is the right learning rate?
  • What about the learning rate scheduler? Where should I set it at?

From my research, most huge models only end up with less than 10 epochs, so on one side, I feel like I don’t really need the scheduler.

Please don’t blast me about pretraining from scratch, I believe I have good reasons why I am doing this the hard way. One of them is that this is a great adventure and a superb learning process.

Thank you for your help.


I always applaud a good learning project! But the price for renting an RTX A5000 GPU, is about $0.26/hr or about $43 for a week.

You generally need to experiment a bit with stuff like learning rate and the amount of epochs, so my best advice for you is to save up a bit more money, buy a used consumer GPU from last generation, and start by creating a smaller model that you can train on your own hardware :laughing:


hmm that’s a good suggestion over there.
Like I mentioned in my post, I have been experimenting with tiny datasets up to the point where I outgrew most freely available GPUs including my tiny RTX2060.
I believe that I have done some good amount of experimentation.
Also the A5000 is actually less than $35 and thats surely affordable to me.

Your graphics card is still bigger than mine, so no worries there :wink:

Have you done any benchmarks on the model you’ve created? If not that’s definitely something I highly recommend that you do.

There’s no use in creating a larger model if the performance isn’t better, so you definitely need to have tests and benchmarks ready in order to compare them.


Yea I did not think about benchmarks. I guess I must do now.

1 Like

question, for the benchmarks do you compare the base model or the finetuned version?

1 Like

Sounds like a good idea!

You can also compare against model’s that are created by others, or different checkpoints that you save during training. The only thing you need to remember is to make sure that the data in your benchmarks isn’t present in your training data as well :laughing:

1 Like

damn, no rest for the wicked :laughing:
I think I should also modify my code to save every epoch, thanks for pointing that out :smile:


Always happy to help!

I hope you have fun and get some good results :heart:

1 Like

This seems like a cool project. I also have an architecture for my own models which I train on some basic data. I was wondering where you get your training data from.


I get the data from Huggingface

1 Like

Okay, I get most of my data from Huggingface too, I just was looking to see if you had a different source. Thanks!

1 Like

No problem! I find huggingface more intuitive than other sources.

The hardest part for these projects is that I can’t test out if my script completely works until I pay for a super expensive server. So sometimes I pay a bunch of money just to find out my script is faulty and it doesn’t work.

Really? you cannot use Kaggle? or Google colab?

Sorry for the late response, it seems I missed your post:

It crashes because they aren’t powerful enough.

1 Like

No problem…
Have yu tried vast?
They have cheaper gpus.
Use my link and I’ll get a commision if you sign up. I can guarantee that yu will like it

1 Like

What do you mean? Google can be a bit stingy with the free GPU’s and will kick you for “inactivity” if you’re using it for training, but kaggle will let you add a NVIDIA Tesla P100 to run a jupyter notebook :laughing:

1 Like

Vast looks pretty interesting, thanks for sharing :laughing:

For anyone who’s curious, here’s the their pricing page:

Kaggle wont make a diff for him since hez got so much data