Correcting wrong answers via fine-tuning

Kyliathy · December 12, 2023, 5:41am

Hi . For a few days I’ve been doing some fine-tuning of GPT 3.5. I am trying to teach it some new concepts about Machinations (an economy design tool). By default, GPT 3.5 has some rather unclear answers regarding how to do certain things in Machinations.

My worst “nightmare” is that the model thinks that Machinations has an Unreal Engine plugin. Machinations DOES have a plugin for Unity 3D, but not for Unreal Engine. And I’m trying to teach the model this basic fact. + a bunch of others where it gives unclear/vague answers (which is expected because Machinations is rather niche - although it DOES know about it and has some decent answers by default).

So, I generated a relatively good set of data for training it, based on the Machinations manual. The data is of course in JSONL. The largest set sits at 1533 examples, which I think is a fairly good chunk of data.

Unfortunately, even after 20 epochs of training, and 2,487,300 tokens, the model still gets wrong a very basic notion that was actually repeated a few times (with different formulations) in the training data.

I ask it if Machinations has an Unreal Engine plugin and the answer is almost always “yes”.

This is how my 20 epochs training looks like. Maybe I’m missing something?

Then, I even tried a fascinating 32 epochs but this time with a smaller file, with just 15 examples in which I basically repeat to it like a mad parrot that THERE IS NO UNREAL ENGINE PLUGIN.

Even after that, it continues hallucinating that there IS, plus, after the 32 epochs training many of the other answers go haywire (like, worse hallucinations).

This is how the smaller file, 32 epochs training looks like:

I would really appreciate some guidance here before I sink more money into this. Spent about $100 so far.

anon22939549 · December 12, 2023, 6:03am

Fine-tuning is not the appropriate tool for this job.

Fine-tuning influences model behaviour, not model knowledge.

Presently, the only way to augment model knowledge is with some type of RAG implementation.

Beyond that, 32 epochs of fine-tuning training is a lot. I would be absolutely shocked if you haven’t overfit.

jwatte · December 12, 2023, 6:13am

You’ll want to use some kind of vector similarity index and prompt augmentation here.
Look up the typical RAG (Retrieval Augmented Generation) approach of using an embedding vector index to find “snippets” that match the question, and provide them in the prompt.

Kyliathy · December 12, 2023, 9:38am

@anon22939549 I have a set of 1533 examples in the Question / Answer format that I used to fine-tune. Are you saying that these are completely un-necessary? I have seen some improvement in some areas, like the AI is giving me a bit more links in the responses (links that I provided in the training data). Also, I’ve been looking at RAG and one of the things listed as best practices for RAG is:

Fine-tune your LLM: Fine-tune your LLM on your specific domain to improve its performance.

What, really, is “performance” in this case? Is performance HOW the model answers, or WHAT it answers? I’m a bit confused and obviously a bit of a newbie

@jwatte I’m curious if you know of any tools or libraries that I could use to implement RAG. I see there are some services online which offer some form of assistance with RAG, but this topic is a bit new to me.

Let’s say I have the manual of this Machinations tool (which I can convert to any format, such as JSON or custom-formatted training data). What do I do with it? Do I generate a database out of it, and then run some RAG/vector library that uses that database?

In that case, I suppose the flow would be:

user asks a question
I use some library to extract context from my vector database
prefix the user query with that context and only AFTER this, prompt the AI
relay the answer back to the user

anon22939549 · December 12, 2023, 10:15am

Maybe? It depends on what you’re trying to accomplish by fine-tuning.

Some common use cases where fine-tuning can improve results:

Setting the style, tone, format, or other qualitative aspects

Improving reliability at producing a desired output

Correcting failures to follow complex prompts

Handling many edge cases in specific ways

Performing a new skill or task that’s hard to articulate in a prompt

https://platform.openai.com/docs/guides/fine-tuning/common-use-cases#:~:text=Some%20common%20use,in%20a%20prompt

Basically, you would fine-tune to change the model’s behaviour. This could be it’s tone, manner of answering, how it understands prompts, etc.

Fine-tuning cannot impart new knowledge unless you’re over-fitting and it’s just regurgitating the fine-tuning data.

HenriqueMelo · December 12, 2023, 12:43pm

Hello @Kyliathy,

I had a similar problem where I needed very specific knowledge that the model lacked. My approach was to create a vector database with the data I needed it to know, and I modified my prompt to embed the most related data from the vector database.

Kyliathy · December 12, 2023, 1:31pm

What vector database did you use? How did you decide how to segment the data? Are there any libraries out there you could recommend? Any articles about how to handle this “vectorization of knowledge”? I’m lacking in skill in this area

jwatte · December 12, 2023, 6:30pm

People seem to like LangChain, and it has an example around that area.

The brief outline is:

build-time:

get each document in text format
chunk each document into fragments sized 100-500 tokens or so, with appropriate headers for each fragment
calculate embedding vectors for each fragment
store embedding+fragment-identifier in some vector database

inference-time:

calculate embedding vector of question
find closest N fragments in the vector database
load text of those fragments back up from files or wherever (some vector dbs do this for you)
construct a prompt that says approximately “given the following information: (document fragments here); answer the following question: (question here)”

HenriqueMelo · December 12, 2023, 6:34pm

@Kyliathy I took a PDF and split it into separate articles, then used an OpenAI API to turn each article into a vector. I chose OpenAI, but honestly, there are other options out there. Some are more budget-friendly because you can run them yourself, and some might even be better than OpenAI in certain ways. I stored these vectors in the Milvus Vector Database. I picked it for its open-source advantage and great documentation, but there are other databases that could be just as good or better.

The tricky was figuring out the right custom query for my prompts (since each prompt needed the embedding data and those prompts had big differences between them) to get the best results from Milvus. After crafting and vectorizing the query, querying the Milvus Database was pretty straightforward.

When it comes to the initial research I did, I can’t exactly recall the specific articles I read. Usually, I start with a basic understanding of what I’m dealing with and why it’s useful, and then dive into the more detailed documentation. Maybe someone else here has some good article recommendations that explain things in a clear, straightforward way.

Kyliathy · December 13, 2023, 11:01am

Thank you @HenriqueMelo & jwatte

@anon22939549 so how does Open AI teach chat GPT new things? Are they also using RAG behind the scenes? I suppose with the new Bing Search feature, that’s exactly what they’re doing? Pre-prompting it automatically?

But come on, the model DOES know a lot of things “by default”. Hmm, maybe in order to introduce new concepts one would have to retrain the ENTIRE MODEL? An operation that would cost tens of thousands of dollars just in compute costs.

Am I on track here? Just want to understand more

anon22939549 · December 13, 2023, 12:02pm

I don’t have any behind the scenes information, but I would appreciate it’s just a continuation of the training of the full models.

You need to understand the differences in scope and scale at play here.

All GPT-3^[1] models were trained on 300B tokens–that’s a lot of tokens, GPT-3.5-Turbo and GPT-4 presumably were trained on quite a few more than that.

With that much training the knowledge and abilities are “emergent.” There’s so much different training data present about so many things expressed in so many ways that is possible to impart new “knowledge” without overfitting.

I suppose, theoretically, if you had enough diverse training tokens about a specific subject—in your case the design tool Machinations—you could impart new knowledge without overfitting, but I’ve never seen it done successfully in practice.

I wouldn’t even want to speculate about how many training tokens you would need to accomplish this, but gun-to-my-head answer would be on the order of 10-million–500-million tokens worth of distinct, exceptional quality, training data, trained for 3–4 epochs would be a reasonable starting point, but that’s a very expensive experiment with no guarantee of success.

Even if it worked you have to remember that even GPT-4 regularly falters on moderately complex python tasks.

To your benefit, I’m guessing Machinations is a several-orders-of-magnitude smaller domain space than Python, so being narrower it may be easier to give deeper knowledge, IDK.

On the other hand, if you have high quality documentation data and good data in question/answer format such as a Machinations forum with solved problems, you could use that as a starting point for creating a vector database of embeddings for injecting into context with a hybrid-search approach.

In the end, I think we’d all need to know more details about what you’re trying to accomplish with your project. Are you trying to build a Machinations expert AI, essentially an unfailing, interactive, documentation interface? If so, I still think RAG is the way to go.

Regardless, I am interested to see how your project turns out.

Tangentially, I clearly know nothing about Machinations myself, but in briefly looking at it, it seems interesting. Does it offer a non-visual, programmatic interface? If so, one interesting think you could look at implementing would be to connect a model to it.

davinci, etc. ↩︎

Kyliathy · December 13, 2023, 8:42pm

@anon22939549 I already made up my mind to use RAG (and possibly other methods of providing context). I will go back into the lab and install qdrant and use my existing code which invokes OpenAI to help me format data via the LLM itself (I already used Charlie, as I lovingly call chatGPT and its API to generate the JSONL training data, now I’ll use it for the vectorization as well, as @HenriqueMelo also did).

As for Machinations, yes, it has an API, which I actually coded about 4 years ago. The problem is that the API is rather limited. It was originally built via socket.io and targeted at connecting Machinations with the Unity game engine. So the focus is purely on synchronizing values between diagrams and the engine. But we’re working on a REST API which will give full read/write access.

Now-a-days we have a ML specialist working with us and we indeed have several initiatives in this area. I’m also looking into diagram minification so that we can ask Charlie to describe diagrams. And you guessed it, one idea is to use function calling to allow Charlie to build diagrams from text (our CEO would probably faint if we actually pull that off and I think it’s not far off )

Topic		Replies	Views
What to do when fine-tuning is not working? API	21	7914	December 24, 2023
Is fine-tuning the tool for this? API fine-tuning	7	270	September 10, 2024
Fine-tuning 3.5 turbo to act as conversational AI like Non-Playable Character in games API fine-tuning	4	1538	October 4, 2023
Can I fine tune without specifying an answer through the "assistant" role? API	6	1160	December 25, 2023
Fine-tuning GPT-3 on entire conversations to mimic style and extract relevant knowledge API	13	4897	December 16, 2023

Correcting wrong answers via fine-tuning

Related topics