I’m a LONG-term Object Pascal nut, and over the years have been involved in Pascal compiler development on various levels.
As you surely are aware, Pascal is completely out of fashion these days. However, for well over 20 years, it used to be a dominant language. Most importantly, due to the system of “unit files” and the Component system, there has been (and still is) a gigantic market of self-contained software components for pretty much every programmer’s task there is. Delphi is still in use for new projects, as is Lazarus and FreePascal (which are open-source closed of Delphi the IDE and Object Pascal, the language).
Even if your target audience is mostly people writing stuff in “hip” languages, I think it would be of great use to support Object Pascal in your model, even if it’s just for translating from Object Pascal to another language.
The great thing that there are archives for all the components created within the last 25 years, most of them being nicely curated, documented and using reference guide compliant language style.
Today GPT3 (Chat-GPT3) already is pretty good at understanding and even writing Object Pascal. It clearly had scraped tons of object pascal repositories. Codex however, isn’t. Most object pascal components do not sit on GitHub, but are available as zipped files on Object Pascal component directories.
Anyway: I’d be willing to curate and provide a couple billion lines of clean Object Pascal, for example in json-lines format. I am not sure how your model training works, but quite often those object pascal libraries come with manuals which would provide additional context.
I’d also be willing to limit the selection of source code to a list of specific licenses - say BSD/MPL/Public Domain.
Simon Kissel (aka scamp)