If the model doesn’t have enough data in its training corpus to devise a proper probability distribution for predicting the next token, one measure to mitigate this issue is to provide matching examples as guidelines. In short, we need to provide properly working and annotated code/language so the LLM can effectively utilize its pattern recognition abilities.
These examples can be implemented in two ways: either by using a second model call to check if the previous output conforms to the syntax of the language, or by providing a set of examples the model can draw from to produce the output in the first place.
The latter method is more intuitive as it resembles a standard few-shot prompting approach, but it doesn’t work well when the model has a high probability for a specific order of tokens that unfortunately doesn’t match our requirements. The approach with two model calls can produce more robust outputs but is less time and cost-efficient.
For example, in the Godot 4 game engine, there has been a breaking change where a particular method no longer takes any arguments as it did previously, but still requires these arguments.
Simplified example:
var velocity = 1
# Godot 4
move_and_slide()
# Godot 3
move_and_slide(velocity)
All GPT models have considerably more training data for the third version of the engine, so the output will regularly contain the additional velocity argument in the function call to move_and_slide.
Either call the model again to check if there is an error to be fixed, or provide a working sample script as a reference. In my experience, when there is wrong training data that regularly supersedes user instructions, a second call for corrections works better.
If there is almost no training data, then providing an example works better. The model is quite adept at implementing a new language given enough examples, but you may encounter cases where the examples don’t properly cover all scenarios.
What’s very important for best results is to provide the best matching examples for whatever you are trying to achieve in a specific situation. You can drop in the complete documentation and all examples every time but this is costly and if the model needs to retrieve knowledge from several places to provide one answer then the output quality will not be as high as a simple needle in a haystack benchmark implies. Splitting the task into several less complex steps then means shorter, more relevant examples and instructions.
Regardless of the technique you use, you will need to provide a relatively large set of working code or documentation for the model to either write in the new language or fix its faulty outputs.