Generating similarities for code generation

I’m trying to ‘teach’ gpt4 to generate better code using a library in python (unfortunately it’s not doing great by default probably due to the scarcity and low quality of code found online for its training).

The approach I’m using right now involves scanning for code listings and generating embeddings from them (then comparing those with an embedding generated from the user prompt so I can identify the best similar snippets to append in the system prompt).

Anyway I have a question: for one of those embeddings generated with openai get_embedding in python, does it make sense to generate that embedding from an entire code snippet that could span multiple python functions and several instructions? Or is it better to (somehow try to) extract the single functions and also try to group multiple out-of-a-function instructions into smaller blocks (e.g. “this block of code tries to do X”) and try to generate a description of what the code is trying to do and finally generate the embedding from that description?

The approach I’m using right now is yielding poorly-similar similarities scores (cosine), and by passing poorly-relevant snippets in the system prompt the quality of the generated code is also rather poor (and sometimes hallucinated).

Thanks for any help.

Getting good quality code from embeddings is a tricky business, similarity does not equal functionality, and code that is semantically similar can give very different results.

It doe “work” just not very well, maybe others here have had more luck. I wish you well

I’d be very curious to know if someone has had any luck improving the accuracy of the generated code from a series of snippets somehow. Does embedding from a small ‘this snippet does this’ or ‘this snippet does that’ or from function names bring any better accuracy than just embedding the entire content codefile or long-snippet?
I could go ahead and just try it out but since it’s a lot of work, I’d rather not embark a rabbit hole diving again and get some advice from someone who’s tried this before (and maybe chose another path altogether).

Yes, I understand where you are coming from, fortunately embeddings are now ridiculously cheap to do, so that is less of a barrier.

The issue as I see it is the power that goes into a model like GPT-4 and the high quality code it can produce is simply missing from the lower order models used for embedding, perhaps some combination of code snippet retrievals that are based on a multi part prompt, so the first part would be to ask the AI, given the request prompt from the user of XYZ what would be the ideal vector database retrieval prompt , then use that to pull back the relevant snippets and then push the top… lets say 3 that get returned into a new prompt which contains the retrievals as context, and then an explanation that the above is vector retrieved context and then specify that the users original prompt request is XYZ and that it should rely heavily on the context to produce it’s new response.

Maybe… that gives the GPT-4 class model enough context to build an accurate solution… worth a shot.

1 Like

Like Foxabilo mentioned, there are several paths you can take.

For the semantic part, documentation of the functions or code overall is important.
For the functional part, you can embed the AST of a file, class, method, globals etc.

The granularity is up to you. Consider chunking and overlapping for this.

1 Like

@PriNova could you please explain the AST, chunking and overlapping a bit better? I didn’t get the idea.
I was going to store the entire snippet in the database along with the calculated embedding. So as soon as a similarity is detected for that embedding, I can just copy and paste the snippets code into the system prompt. What’s the ast+chucking and overlapping approach?

Ok I did some reading: by ‘embedding the AST’ you mean using a parser for the python files and using the text for the function/class/interface names and their (potentially present) documentation comments to generate the embeddings?

But in that case why would I use chunking? I don’t expect those to be too long (since I’m not generating embeddings from the code inside those functions/class/interfaces). Or did I misunderstand your intent? I also don’t see any overlap with this approach (I’m assuming most of the functions aren’t duplicates anyway)

Maybe my reply yesterday was not clear enough (late time replies suffer coherence. Sorry for that).

For the semantic part to embed your codebase, it needs to break it down into chunks .e.g 1000 tokens, words, characters etc. And some embedding frameworks provide an overlapping feature so that the chunks are overlapping at the beginning and the end of every chunk for better tracking of every chunk. Code with comments in it will be more effective.

With the functional part, you are able to translate your codebase into an Abstract Syntax Tree (AST) and then store this AST as an embedding. Here it depends on the granularity of the AST in regards to the scope. You can store the AST for a complete file, class, function or global variables etc. Or an execution path, branching path, iteration path (loops, etc).

With the granularity and generalization as embeddings you are able to find code similarities, code duplications or simply a intelligently way to have a library of components for reuse. And in your case to find similarities to teach GPT.

The semantic embeddings can be queried with simple prompts by you or the user, the functional embeddings are some kind of internal reflection (code base indexing and retrieval) of the codebase and can also be queried by the user in combination with the semantic embeddings.

Or some kind of hybrid approach to have both worlds combined.

I hope this clarifies and fits with your use case.

Got it, so maybe a ‘hybrid’ approach? I.e. encode the code snippets as class/function/interface name + parameter_names + docstrings as a ‘syntactic’ embedding, and then use a code2seq or the like to generate embeddings based on their AST paths (and get the ‘semantic’ meaning as well). Then whatever the user prompts, I can generate an embedding based off of his prompt (whether a textual description or code) and see if I get some good similarity results for relevant coding snippets. Does this make any sense?

1 Like

last question @PriNova : even if I generated the embeddings after generating the sequences from the codebase ASTs with code2seq, will this still work to find relevant snippets according to the user input? This is not taking into account any function documentation, comments or whatever. I fear that the similarity results are going to be meaningless if the user asks “write code that does X” instead of entering code that can be also code2seq’d, embedde’d and similarity-compared.