Correcting wrong answers via fine-tuning

Hi :slight_smile: . For a few days I’ve been doing some fine-tuning of GPT 3.5. I am trying to teach it some new concepts about Machinations (an economy design tool). By default, GPT 3.5 has some rather unclear answers regarding how to do certain things in Machinations.

My worst “nightmare” is that the model thinks that Machinations has an Unreal Engine plugin. Machinations DOES have a plugin for Unity 3D, but not for Unreal Engine. And I’m trying to teach the model this basic fact. + a bunch of others where it gives unclear/vague answers (which is expected because Machinations is rather niche - although it DOES know about it and has some decent answers by default).

So, I generated a relatively good set of data for training it, based on the Machinations manual. The data is of course in JSONL. The largest set sits at 1533 examples, which I think is a fairly good chunk of data.

Unfortunately, even after 20 epochs of training, and 2,487,300 tokens, the model still gets wrong a very basic notion that was actually repeated a few times (with different formulations) in the training data.

I ask it if Machinations has an Unreal Engine plugin and the answer is almost always “yes”.

This is how my 20 epochs training looks like. Maybe I’m missing something?

Then, I even tried a fascinating 32 epochs but this time with a smaller file, with just 15 examples in which I basically repeat to it like a mad parrot that THERE IS NO UNREAL ENGINE PLUGIN.

Even after that, it continues hallucinating that there IS, plus, after the 32 epochs training many of the other answers go haywire (like, worse hallucinations).

This is how the smaller file, 32 epochs training looks like:

I would really appreciate some guidance here before I sink more money into this. Spent about $100 so far.

Fine-tuning is not the appropriate tool for this job.

Fine-tuning influences model behaviour, not model knowledge.

Presently, the only way to augment model knowledge is with some type of RAG implementation.

Beyond that, 32 epochs of fine-tuning training is a lot. I would be absolutely shocked if you haven’t overfit.

1 Like

You’ll want to use some kind of vector similarity index and prompt augmentation here.
Look up the typical RAG (Retrieval Augmented Generation) approach of using an embedding vector index to find “snippets” that match the question, and provide them in the prompt.

2 Likes

@elmstedt I have a set of 1533 examples in the Question / Answer format that I used to fine-tune. Are you saying that these are completely un-necessary? I have seen some improvement in some areas, like the AI is giving me a bit more links in the responses (links that I provided in the training data). Also, I’ve been looking at RAG and one of the things listed as best practices for RAG is:

  • Fine-tune your LLM: Fine-tune your LLM on your specific domain to improve its performance.

What, really, is “performance” in this case? Is performance HOW the model answers, or WHAT it answers? I’m a bit confused and obviously a bit of a newbie :slight_smile:

@jwatte I’m curious if you know of any tools or libraries that I could use to implement RAG. I see there are some services online which offer some form of assistance with RAG, but this topic is a bit new to me.

Let’s say I have the manual of this Machinations tool (which I can convert to any format, such as JSON or custom-formatted training data). What do I do with it? Do I generate a database out of it, and then run some RAG/vector library that uses that database?

In that case, I suppose the flow would be:

  1. user asks a question
  2. I use some library to extract context from my vector database
  3. prefix the user query with that context and only AFTER this, prompt the AI
  4. relay the answer back to the user

Maybe? It depends on what you’re trying to accomplish by fine-tuning.

Some common use cases where fine-tuning can improve results:

  • Setting the style, tone, format, or other qualitative aspects
  • Improving reliability at producing a desired output
  • Correcting failures to follow complex prompts
  • Handling many edge cases in specific ways
  • Performing a new skill or task that’s hard to articulate in a prompt

https://platform.openai.com/docs/guides/fine-tuning/common-use-cases#:~:text=Some%20common%20use,in%20a%20prompt

Basically, you would fine-tune to change the model’s behaviour. This could be it’s tone, manner of answering, how it understands prompts, etc.

Fine-tuning cannot impart new knowledge unless you’re over-fitting and it’s just regurgitating the fine-tuning data.

1 Like

Hello @Kyliathy,

I had a similar problem where I needed very specific knowledge that the model lacked. My approach was to create a vector database with the data I needed it to know, and I modified my prompt to embed the most related data from the vector database.

1 Like

What vector database did you use? How did you decide how to segment the data? Are there any libraries out there you could recommend? Any articles about how to handle this “vectorization of knowledge”? I’m lacking in skill in this area :slight_smile:

People seem to like LangChain, and it has an example around that area.

The brief outline is:

build-time:

  1. get each document in text format
  2. chunk each document into fragments sized 100-500 tokens or so, with appropriate headers for each fragment
  3. calculate embedding vectors for each fragment
  4. store embedding+fragment-identifier in some vector database

inference-time:

  1. calculate embedding vector of question
  2. find closest N fragments in the vector database
  3. load text of those fragments back up from files or wherever (some vector dbs do this for you)
  4. construct a prompt that says approximately “given the following information: (document fragments here); answer the following question: (question here)”
1 Like

@Kyliathy I took a PDF and split it into separate articles, then used an OpenAI API to turn each article into a vector. I chose OpenAI, but honestly, there are other options out there. Some are more budget-friendly because you can run them yourself, and some might even be better than OpenAI in certain ways. I stored these vectors in the Milvus Vector Database. I picked it for its open-source advantage and great documentation, but there are other databases that could be just as good or better.

The tricky was figuring out the right custom query for my prompts (since each prompt needed the embedding data and those prompts had big differences between them) to get the best results from Milvus. After crafting and vectorizing the query, querying the Milvus Database was pretty straightforward.

When it comes to the initial research I did, I can’t exactly recall the specific articles I read. Usually, I start with a basic understanding of what I’m dealing with and why it’s useful, and then dive into the more detailed documentation. Maybe someone else here has some good article recommendations that explain things in a clear, straightforward way.

1 Like

Thank you @HenriqueMelo & jwatte

@elmstedt so how does Open AI teach chat GPT new things? Are they also using RAG behind the scenes? I suppose with the new Bing Search feature, that’s exactly what they’re doing? Pre-prompting it automatically?

But come on, the model DOES know a lot of things “by default”. Hmm, maybe in order to introduce new concepts one would have to retrain the ENTIRE MODEL? An operation that would cost tens of thousands of dollars just in compute costs.

Am I on track here? Just want to understand more :slight_smile:

I don’t have any behind the scenes information, but I would appreciate it’s just a continuation of the training of the full models.

You need to understand the differences in scope and scale at play here.

All GPT-3[1] models were trained on 300B tokens–that’s a lot of tokens, GPT-3.5-Turbo and GPT-4 presumably were trained on quite a few more than that.

With that much training the knowledge and abilities are “emergent.” There’s so much different training data present about so many things expressed in so many ways that is possible to impart new “knowledge” without overfitting.

I suppose, theoretically, if you had enough diverse training tokens about a specific subject—in your case the design tool Machinations—you could impart new knowledge without overfitting, but I’ve never seen it done successfully in practice.

I wouldn’t even want to speculate about how many training tokens you would need to accomplish this, but gun-to-my-head answer would be on the order of 10-million–500-million tokens worth of distinct, exceptional quality, training data, trained for 3–4 epochs would be a reasonable starting point, but that’s a very expensive experiment with no guarantee of success.

Even if it worked you have to remember that even GPT-4 regularly falters on moderately complex python tasks.

To your benefit, I’m guessing Machinations is a several-orders-of-magnitude smaller domain space than Python, so being narrower it may be easier to give deeper knowledge, IDK.

On the other hand, if you have high quality documentation data and good data in question/answer format such as a Machinations forum with solved problems, you could use that as a starting point for creating a vector database of embeddings for injecting into context with a hybrid-search approach.

In the end, I think we’d all need to know more details about what you’re trying to accomplish with your project. Are you trying to build a Machinations expert AI, essentially an unfailing, interactive, documentation interface? If so, I still think RAG is the way to go.

Regardless, I am interested to see how your project turns out.

Tangentially, I clearly know nothing about Machinations myself, but in briefly looking at it, it seems interesting. Does it offer a non-visual, programmatic interface? If so, one interesting think you could look at implementing would be to connect a model to it.


  1. davinci, etc. ↩︎

1 Like

@elmstedt I already made up my mind to use RAG (and possibly other methods of providing context). I will go back into the lab and install qdrant and use my existing code which invokes OpenAI to help me format data via the LLM itself (I already used Charlie, as I lovingly call chatGPT and its API to generate the JSONL training data, now I’ll use it for the vectorization as well, as @HenriqueMelo also did).

As for Machinations, yes, it has an API, which I actually coded about 4 years ago. The problem is that the API is rather limited. It was originally built via socket.io and targeted at connecting Machinations with the Unity game engine. So the focus is purely on synchronizing values between diagrams and the engine. But we’re working on a REST API which will give full read/write access.

Now-a-days we have a ML specialist working with us and we indeed have several initiatives in this area. I’m also looking into diagram minification so that we can ask Charlie to describe diagrams. And you guessed it, one idea is to use function calling to allow Charlie to build diagrams from text (our CEO would probably faint if we actually pull that off :smiley: and I think it’s not far off )

2 Likes