Hi, I know that OpenAI’s text embeddings measure the relatedness of text.
I am new to this field, so probably for some of you this question would be trivial. Anyway, I was wondering if is it possible to use this technique with source code.
I was trying to figure out a way to analyse a source code, but due to token limitation, one way to save prior knowledge could have been this.
For example if I have a list of source codes, I can search similarities within the list.
Any advice? Is it possible or I am just blathering on?
I think it would be worth testing. I haven’t seen anything like this yet. But if the code also included comments, I think it would be even more likely to work.
Depending on how you’re trying to use them, I recommend trying to design a small experiment. My bet is that it will at least work somewhat. But how well it works, and whether that’s good enough, I’m not sure.
On big question is: How can I save the various source code in a dataset? I just have a big list of source codes, but files are not structured. They simply contain description and comments, but they are differents files.
Sorry if the question seems to be trivial. Thanks for your patience!
I think you’re asking about how to format the inputs for generating the embeddings.
- parse the code into logical code blocks
- include the file path at the beginning of each embed input
- if the code block is part of a class/module, I would include that, too
Designing embed inputs, imo, and at this point in time, is more of an art than science. So you’ll have to try things and improve with the lessons you learn, which could include lessons that are specific to your situation.