Question about embeddings

mrsalzan · March 13, 2023, 4:07pm

Hi, I know that OpenAI’s text embeddings measure the relatedness of text.
I am new to this field, so probably for some of you this question would be trivial. Anyway, I was wondering if is it possible to use this technique with source code.
I was trying to figure out a way to analyse a source code, but due to token limitation, one way to save prior knowledge could have been this.
For example if I have a list of source codes, I can search similarities within the list.

Any advice? Is it possible or I am just blathering on?

wfhbrian · March 13, 2023, 4:18pm

I think it would be worth testing. I haven’t seen anything like this yet. But if the code also included comments, I think it would be even more likely to work.

Depending on how you’re trying to use them, I recommend trying to design a small experiment. My bet is that it will at least work somewhat. But how well it works, and whether that’s good enough, I’m not sure.

mrsalzan · March 13, 2023, 4:33pm

On big question is: How can I save the various source code in a dataset? I just have a big list of source codes, but files are not structured. They simply contain description and comments, but they are differents files.
Sorry if the question seems to be trivial. Thanks for your patience!

wfhbrian · March 13, 2023, 5:04pm

I think you’re asking about how to format the inputs for generating the embeddings.

I would:

parse the code into logical code blocks
include the file path at the beginning of each embed input
if the code block is part of a class/module, I would include that, too

Designing embed inputs, imo, and at this point in time, is more of an art than science. So you’ll have to try things and improve with the lessons you learn, which could include lessons that are specific to your situation.

Topic		Replies	Views
[GitHub] Embeddings for Entire GitHub Code Repository API	3	4655	April 15, 2023
Generating similarities for code generation API embeddings , gpt-4 , api , code	9	1872	June 24, 2023
Use embeddings to measures how well an answer fits the question API embeddings	5	329	June 29, 2024
OpenAI Embeddings - Search through ~1000 PDFs API embeddings	3	3325	August 28, 2024
Does openAI provide API that takes Embeddings as an input? API embeddings	10	4151	December 18, 2023

Question about embeddings

Related topics