Love to read more language-related stats, etc, on codex-davinci's training data!

I found the july paper to be a great read but seems like it was written in the discourse of a model fully trained in python.

Our training dataset was collected in May 2020 from 54 million public software repositories hosted on GitHub, containing 179 GB of unique Python files under 1 MB. We filtered
out files which were likely auto-generated, had average line
length greater than 100, had maximum line length greater
than 1000, or contained a small percentage of alphanumeric
characters. After filtering, our final dataset totaled 159 GB

Now that javascript, typescript, haskell, etc are all generat-able (e.g. under codex-davinci), I’ve been looking around in the docs for something like a ratio of how much training data composed of javascript vs python and so on.

Would apprecaite any pointers on where I can learn more about them!

As well as more info on the training data like

  • other than github repos, are docstring/serialized stack-overflow Q&A parts of the training data?

  • looking at the training data, how much can we infer that codex-davinci’s task-capability are (somewhat) transferable across langugages? E.g. When exposed with enough javascript & python, davinci can replicate in javascript things it learns to do using human-eval and other python code?