The company I work for uses a proprietary scripting language to implement policy. I’m interested in training GPT to understand this language for a few purposes:
- Write custom code with a natural language prompt.
- Interpret existing code and explain its purpose/behavior to users.
- Troubleshoot by interpreting code and explaining the cause of unexpected behavior.
- Act as a resource for users trying to learn the language.
Currently, I’m working on building a training dataset with some of the following:
- Prompt-completion pairs for each function and procedure in the language. I may create each in reverse (e.g. “What function does X?” and “What does the X function do?”)
- Prompts regarding syntax of common programming structures (e.g. LOOPs)
- Prompts regarding how to create functions, calling them, passing arguments, etc.
- Prompts regarding data types, supported operators, etc.
- Prompts for code to solve relatively simple use cases and completions with only code and explanation in comments.
I’m looking for any advice on how to accomplish this task. Some questions I have in mind…
- What model should I use?
- I assume a FineTune is the right approach here, but if there’s a use for embedding, then I’d like to better understand. I’m still struggling a bit with understanding the use cases of each.
- How should I format the training data? Can I use markdown in the completions?
- Anything you anticipate I might not be considering that I should?