Love to read more language-related stats, etc, on codex-davinci's training data!

archy · September 25, 2021, 1:32am

I found the july paper to be a great read but seems like it was written in the discourse of a model fully trained in python.

Our training dataset was collected in May 2020 from 54 million public software repositories hosted on GitHub, containing 179 GB of unique Python files under 1 MB. We filtered
out files which were likely auto-generated, had average line
length greater than 100, had maximum line length greater
than 1000, or contained a small percentage of alphanumeric
characters. After filtering, our final dataset totaled 159 GB

Now that javascript, typescript, haskell, etc are all generat-able (e.g. under codex-davinci), I’ve been looking around in the docs for something like a ratio of how much training data composed of javascript vs python and so on.

Would apprecaite any pointers on where I can learn more about them!

As well as more info on the training data like

other than github repos, are docstring/serialized stack-overflow Q&A parts of the training data?
looking at the training data, how much can we infer that codex-davinci’s task-capability are (somewhat) transferable across langugages? E.g. When exposed with enough javascript & python, davinci can replicate in javascript things it learns to do using human-eval and other python code?

Thanks!

Topic		Replies	Views
Training Codex on my product's API API codex	5	2216	February 18, 2023
Training text-davinci with code corpus API codex	2	987	March 26, 2024
Fine-tuning a codex model? API codex	10	2566	July 25, 2023
Training text-davinci-002 with custom code API codex	4	3252	March 26, 2024
Text-davinci-003 vs code-davinci-002 - how do they compare on code? Prompting	3	2346	July 13, 2023

Love to read more language-related stats, etc, on codex-davinci's training data!

Related topics