I’m having difficulty finding the size of the data used to train GPT-3. Searches return wildly divergent answers, anywhere from 570GB to 45TB. Language Models are Few-shot Learners would seem to be the definitive source. The largest training set was CommonCrawl which “. . . was downloaded from 41 shards of monthly CommonCrawl covering 2016 to 2019, constituting 45TB of compressed plaintext before filtering and 570GB after filtering, roughly equivalent to 400 billion byte-pair-encoded tokens.” That explains the divergent answers I’m seeing. But the CommonCrawl was not the only data source, and a corresponding table indicates that it was only 60% of the data used. That gives me 570/0.6=950GB for the training set. Am I missing anything?
I ask because the GPT-3 model is 800GB. Am I crazy thinking it is bizarre that you have a statistical model that is only fractionally smaller than the data it is trained on?
I’m not any kind of expert on large language models or deep learning, so I’m happy to be enlightened.