Recommendations for Validating Output of RAG System for Code Generation

Hi All,

I am building a Retrieval-Augmented Generation (RAG) system for code generation in an unfamiliar programming language. Given that I have a few prompts and their expected outputs, I am looking for recommendations on how to validate that the output from the RAG system closely matches the expected results.

Which method would you suggest for this purpose? Options I am considering include:

  • Abstract Syntax Trees (ASTs)
  • Vector embedding similarity
  • Others?

Your insights would be greatly appreciated.

Thank you!

Out of interest, why is RAG necessary at all here?

Isn’t the LLM trained on this language?

Yes, but because it’s unfamiliar programming language its dont have enough data.

1 Like

Out of curiosity, what is the language?

If the model doesn’t have enough data in its training corpus to devise a proper probability distribution for predicting the next token, one measure to mitigate this issue is to provide matching examples as guidelines. In short, we need to provide properly working and annotated code/language so the LLM can effectively utilize its pattern recognition abilities.

These examples can be implemented in two ways: either by using a second model call to check if the previous output conforms to the syntax of the language, or by providing a set of examples the model can draw from to produce the output in the first place.

The latter method is more intuitive as it resembles a standard few-shot prompting approach, but it doesn’t work well when the model has a high probability for a specific order of tokens that unfortunately doesn’t match our requirements. The approach with two model calls can produce more robust outputs but is less time and cost-efficient.

For example, in the Godot 4 game engine, there has been a breaking change where a particular method no longer takes any arguments as it did previously, but still requires these arguments.

Simplified example:

var velocity = 1

# Godot 4 

# Godot 3

All GPT models have considerably more training data for the third version of the engine, so the output will regularly contain the additional velocity argument in the function call to move_and_slide.

Either call the model again to check if there is an error to be fixed, or provide a working sample script as a reference. In my experience, when there is wrong training data that regularly supersedes user instructions, a second call for corrections works better.

If there is almost no training data, then providing an example works better. The model is quite adept at implementing a new language given enough examples, but you may encounter cases where the examples don’t properly cover all scenarios.
What’s very important for best results is to provide the best matching examples for whatever you are trying to achieve in a specific situation. You can drop in the complete documentation and all examples every time but this is costly and if the model needs to retrieve knowledge from several places to provide one answer then the output quality will not be as high as a simple needle in a haystack benchmark implies. Splitting the task into several less complex steps then means shorter, more relevant examples and instructions.

Regardless of the technique you use, you will need to provide a relatively large set of working code or documentation for the model to either write in the new language or fix its faulty outputs.


I largely agree with the above opinion.

It would be beneficial to provide the actual syntax and possible arguments of the code as a system message.

Additionally, it might be helpful to present code samples through RAG as needed.