How to fine-tune gpt3.5-turbo to give it new knowledge?

I have a GitHub repo with docs, code etc. I want GPT to know. Currently it doesn’t and the entire code base doesn’t fit in it’s context. The entire code base also doesn’t fit in Claude’s code base. Therefore I want to fine tune / train GPT on this new code base.

But the data has to be given as conversations prompts. I thought I could just update the text files and OpenAI would train the data. It’s not scalable for me to show by conv messages examples the responses I want.

Thus, what is the recorded way to fine tune GPT in my use case?

Perhaps the better way is to fine tune myself a llama model then fine tune more with instruction following data with DPO/SFT or something.

2 Likes

You want Retrieval Augmented Generation (RAG) with text embeddings not fine-tuning.

3 Likes

Welcome to the community, @brandojazz!

I agree with @anon22939549.

Fine-tuning is recommended to save on prompt tokens for a high volume of calls for specific use case(s) or to set model behavior.

For knowledge augmentation and retrieval, embeddings is the go-to approach.

Here’s the documentation on embeddings.

Currently, GPT models have a finite context length, which limits the number of tokens (prompt + completion) they can handle.

To overcome this, you can pass an outline of your repository to the model and give the model access to “see” the code, similar to how Advanced Data Analysis does on ChatGPT or like the open-interpreter does locally.

Once the model selects the file(s) to be used with RAG, you can obtain the embeddings and look for semantically similar chunks to be passed as context.

2 Likes

why do you think I want RAG? Is it because once it’s fine-tuned it will lose the chat capabilities and has to be re-fined tuned? This is my assumption given that OpenAI forces you to give the fine-tuned data in chat format/instruction following format.

If I had access to the model I could do DPO to it myself.

Perhaps the best solution would be RAG + fine-tuning on the base model. With OpenAI being closed about what they actually do perhaps it’s not as good as an idea I naively think.

1 Like

Based on my limited experience with fine-tuning on the OpenAI platform, you’re correct that in fine-tuning will degrade the models performance when your prompts do not reflect those including in the training set.

I have an idea built off your notion of RAG + fine tuning which you might find interesting. I’ve read a number of articles which demonstrate that a fine-tuned model trained off domain-specific question-answer pairs does not work as effectively as vanilla RAG.

However, I do believe that fine-tuning can contribute towards creating an extremely powerful retrieval system.

Most RAG systems you see work by embedding the user’s query (generally a question), embedding it, and performing a similarity search against your knowledge base, aiming to find the chunk of text in your knowledge base which contains the answer to the user’s prompt.

This is very effective in most cases, especially if the knowledge base is not overly large.

The reason RAG beats a fine-tune for question answering is because fine-tuning generally helps with tone and style, not adding knowledge to the system. I suppose it’s possible with lots of testing to develop a fine-tuned model that does effectively add knowledge to the model but it’s extremely difficult to balance/optimize (avoid overfitting, evals, etc). It would also be difficult to create an all-encompassing dataset of examples to use in fine-tuning.

I do believe however that the best RAG system given the OpenAI tools we can access at the moment does include a fine-tuned model.

The function of the fine-tune is not to answer questions directly, but rather generate “synthetic” responses which are used strictly to embed to compare to the document chunks in your knowledge base.

Allow me to support this idea:
Considering that GPT-4 + RAG is extremely accurate (with the right prompt structure), there is no problem to solve as long as the correct context is passed to the model.
Therefore, the main potential issue is the failure for the embeddings algorithm to properly match the user’s prompt to the correct chunk in your knowledge base; if the proper chunks are not passed to GPT-4’s context, it won’t answer correctly.

Using a fine-tuned model for synthetic strings to use for the similarity search allows us to create a very close semantic match to the correct chunk of context in the training set. Instead of matching a user’s question to the answer in the knowledge base, we use a fine-tuned model generated string, ingesting the user’s question as its prompt (which is able to match the tone and style of the content in the knowledge base).

So, in this method, we use the best aspects of RAG and fine-tuning to develop the ultimate retrieval system.

In order to develop the dataset for this, I would recommend feeding chunks of your codebase/documentation to GPT-4 and saying, “Generate 5 questions to which this chunk of information is the answer”, and fine-tune from there.

Here’s a simple example that conveys the idea… however the benefit of this system likely only emerges with more complex tasks.

  • User: “amzn founding year”
  • Synthetic GPT: “Amazon, the company, was founded in 1995.”

Knowledge base: “…Amazon, a company, was founded in 1994.”

So, we can see that even though the synthetic string is not accurate and does not have “added knowledge”, the change in tone and style can be exploited. Could be a cool trick to try out :slight_smile:

Hope this was helpful

2 Likes

You seem to know a lot, respectfully I want to ask something to you. I want to fine tune GPT3.5 Turbo for code generation for a Spesific Pyhton Library Called Manim. I want GPT3.5 to be very well informed on this Library so when I prompt it I want good quality code without syntax errors. How do I do this

Im working on a similar task. I’ll build a RAG for the Codebase of Neos CMS. It’s a huge Task to transform a complete Framework in to a RAG. I’m thinking about to transform the whole filestructure in to a knowledge graph on neo4j. So i have Nodes of classes and their methods and dependencies. In parallel i’ll try to generate vectors for each node. That i can use the graph and vectors in combination. Additionally i plan to add example Code for Components and Packages.

Nice Idea. :eyes:

Orca2 is really good in decision making. Maybe this is a good choice for the finetuned model and generating the right querys for the RAG. In my case it could decide for a usage of the graph or doing a vector search. :thinking:

There is so much to learn… :smiley:

Funny you say this, I am working on AI + Manim now. In short, I would suggest that fine-tuning is not the best use case for Manim. You’re better off scraping the documentation and setting up a knowledge base. Start here: Reference Manual - Manim Community v0.18.0, and you can scrape the content from all the hrefs on the page, and then think of a way to use gpt-3.5 to retrieve the correct context

That’s the way I would approach the thing

hi patrick, I scraped all the documentation for ManimCE in json. I can share it with you, would you like to meet and discuss? my email is ahmettungabayrak@gmail.com