80% faster, 50% less memory, 0% loss of accuracy LLM finetuning

Hey OpenAI community! It’s been a while since I last posted so hi again!

I just launched Unsloth GitHub - unslothai/unsloth: 80% faster 50% less memory LLM finetuning which allows you to finetune LLMs 5x faster, use 50% less memory all on your local GPU. It currently works for Llama, so not ChatGPT’s finetuning, but I’m just sharing all the stuff I learnt! It could be useful to make OpenAI’s finetuning faster as well!!

You can now finetune

  • 5x faster (5 hours to 1 hour)
  • Use 50% less memory
  • With 0% loss in accuracy
  • All locally on NVIDIA GPUs (Tesla T4, RTX 20/30/40, Ampere, Hopper) for free!

How? By:

  1. Hand deriving backpropagation steps
  2. Optimizing matrix chain multiplication bracketing
  3. Writing all kernels in OpenAI’s Triton language
  4. Reduce data movements via inplace operations
  5. And doing other maths and coding trickery!

I wrote up a blog post about all the manual hand derived backprop via Introducing Unsloth.

I wrote a Google Colab for T4 for Alpaca: Google Colab which finetunes Alpaca 2x faster on a single GPU.

On Kaggle via 2 Tesla T4s on DDP: Unsloth - LAION Chip2 Kaggle, finetune LAION’s OIG 5x faster and Slim Orca 5x faster.

You can install Unsloth all locally via:

pip install "unsloth[cu118] @ git+https://github.com/unslothai/unsloth.git"

pip install "unsloth[cu121] @ git+https://github.com/unslothai/unsloth.git"