…[SNIP]… Following the methods outlined above, the suggested 70B Chinchilla outperforms Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG consistently and significantly (530B). The researchers also discovered that, despite employing various fitting procedures and trained models, these three approaches produce comparable predictions for optimal parameter and token scaling with FLOPs.
Overall, this research contributes to developing an effective training paradigm for large auto-regressive language models with limited compute resources. It is standard practice to increase model size without matching the number of training tokens. However, the team recommends that the number of training tokens is twice for every model size doubling. This means that using larger, higher-quality training datasets can lead to better results on downstream tasks. [PAPER] [SOURCE]
Woah, what a time to be alive!