Can GPT generate its own finetuning training data?

I ran into this video the other day: Finetune multiple cognitive tasks with GPT-3 on medical texts (and reduce hallucination) - YouTube
His goal is to ask GPT3 questions about a particular medical record but without confabulating.
In short, his approach is this:

  1. he has the medical record of 200 patients as text files
  2. he found prompts that can summarize a list of medications, as well as a prognosis/diagnosis for a particular record. Pretty cool.
  3. he applies this prompt to every patient record and sends it to text-davinci-002. The responses are saved in one JSONL file
  4. he feeds this JSONL to the finetuning API to create a new model.
  5. The expectation is now that the new model is giving better answers to the same prompts than the plain text-davinci-002.

I can’t understand why the finetuned model would be better than text-davinci-002. Seems unintuitive that without adding any information (e.g. manually adjusting the output of step 3 for accuracy) the model would get better.

Any thoughts?


I explained the reasoning in the video, another thread, and numerous other videos. You might benefit from reading my books and watching more of my finetuning videos. There are several reasons why GPT-3 can generate its own data and achieve better results than just prompt engineering.

  1. You can filter out aberrations. One of the chief purposes of finetuning is getting consistent results. Better training data == better results. If you watch the video more carefully, as well as my other videos, I explain this when I delete aberrational outputs.
  2. You can incorporate multiple prompts (as demonstrated in the video) which effectively consolidates multiple steps into a single task, thus saving token costs while achieving better results (see previous point). This reduces the chances of aberrations developing during “prompt chaining.”
  3. You can incorporate dozens (or hundreds) of different prompts and datasources in your finetuning model, which creates a positive value-add beyond just generating outputs from individual prompts. To put that another way, by combining real-world data with synthetic data from GPT-3, you can train a finetuned model that is far more robust.

The key difference here is switching from a quantitative mindset to a qualitative mindset.


Thanks David,
Still don’t get it. The three points you are making sound more like a reason for why finetuning is useful (which I don’t doubt).
Would you mind summarizing here why and how a model can get more accurate by feeding it its own output as training data (without manually adjusting every single record of the the output for accuracy). It’s totally counterintuitive, isn’t it. I can’t see where you delete aberrational outputs in said video, so forgive my my disbelieve.

Also, what’s ironic is that plain davinci seems to be less confabulating than your trained model. Check out completion_1653053632.928051.json and compare it with where you test your model with the same input.


I never made that claim. I said it would become more consistent. Accuracy is a measurement of proximity to the truth. Consistency is a measurement of resilience in the face of variation.

By training it the way I did, the model can handle widely inconsistent input. Sure, this one demo video may not thoroughly demonstrate that individual concept, which is why I recommended you watch more of my videos. This video was mostly to prove that you can consolidate two steps into one, which was the nature of the original challenge I was given.

1 Like

David, I understood that what you are trying to do in the video is making GPT less confabulating and hallucinating (your words). Confabulating and hallucinating sounds like not close enough to the truth - thus asking it for a list of medication or a prognosis given becomes inaccurate. Hence less confabulating and hallucinating means closer proximity to the truth (thus more accurate), doesn’t it?

Making a model better by feeding it what it already knows sounds counterintuitive and too good to be true, which is why I’d love to understand your rationale. Kindly explain why a model would be less confabulating/hallucinating by finetuning it with its own output. Or point me to a particular video that explains? Much appreciated.


Okay, I think I see the problem. Let me put on my professor hat and see if we can help you understand.

I see your confusion. I am inferring here that you’re coming from a conventional ML background, yes? So I’m assuming you’re familiar with things like SVM and KNN. For the sake of this discussion, I will use the term accuracy to mean “proximity to the truth” and precision shall mean deviation in performance.

Reducing confabulation is more about increasing precision than accuracy. Consult the below image for reference:

In this case, the entire purpose of finetuning to reduce confabulation/hallucination is to get the cluster tighter (right side of the diagram). I was not concerned about accuracy (proximity to the truth) as that was not the challenge I was given. The challenge I was given was to prevent GPT-3 from making stuff up entirely. My goal was to demonstrate the lower-right quadrant (and if it got the upper right quadrant, great!, but not my goal).

With LLM, there is an entirely new dimension to precision - confabulation would be outside of the problem space identified by the circle zone in the above diagram. Or it might be the wrong color, or shape. Hallucination and confabulation in GPT-3 mean that the output is in no way connected to the input - which is a result that is simply not possible with strictly mathematical models like SVM and KNN. Imagine a 3D problem space bisected by a 2D hyperplane in an SVM, and yet it spits out a 4D result - that is what hallucination is like in GPT-3. Or rather, it spits out a polynomial value when you’re expecting a vector. That’s what I mean by hallucination.

As to the poor performance in the video, there are two reasons: (1) it was trained on CURIE and (2) it did not have many samples. Honestly, I was impressed with how well it did on CURIE. If I were to double or triple the sample size and add a good STOP token, it would perform as well as DAVINCI.

So this indicates a third dimension to finetuning that you missed - using DAVINCI to generate training data to finetune CURIE. This is another huge advantage that can get CURIE finetunes to outperform other CURIE models. For instance, go try the initial prompts on a base CURIE model and see how poorly it performs compared to a finetuned CURIE.

UPDATE: here’s a modified graphic that demonstrates hallucination in the context of precision vs accuracy

(In actuality, hallucination/confabulation is often negative on both accuracy and precision because it is so disconnected with the actual input)


Thanks for the elaborate explanation David. This sounds like bias versus variability, but I’m still at a loss whether it’s legit to have GPT generate its own finetuning training data.

Sorry I missed that, but what was the task to you were given? Was that one of the “Send me your GPT problems” tasks (just can’t find the original task which might help me understand)?
Given that the example was about asking tld;dr questions to a medical record, I was assuming that proximity to the truth is what we are looking for - unless this was built for Theranos :slight_smile:

Finetuning a cheap model with data that an expensive model generated sounds in fact interesting if you want to teach the cheap model a particular domain I figure. I’ll experiment with that for sure.


Not quite. In my opinion, confabulation and hallucination represent an entirely new phenomenon where machine learning are concerned. When you talk about bias you’re talking about erroneous assumptions baked into the training data - of which I’m sure there’s plenty in GPT-3. But there’s not likely any given bias towards any particular medication or treatment beyond how common some are (for instance, I bet GPT-3 training data had more instances of aspirin than Interferon gamma-1b).

Variability is just another word for saying “low precision” - but that is still not a fully accurate way to describe confabulation and hallucination. I would recommend that you induct confabulation into your lexicon when discussing LLM and transformers. The definition of confabulation, in this context, is to fill in gaps in memory by fabrication. The key word here is fabrication i.e. to make something up from nothing.

I should note that confabulation is a neuroscience term as well, showing that we are seeing convergence between artificial and organic neurology. Here’s a great article discussing confabulation as it pertains to psychiatry: Confabulation: A Bridge Between Neurology and Psychiatry?

This methodology is called “synthetic data” and fortunately GPT-3 has already been proved to be effective here: GPT3 Synthetic Data — AmeliorMate

And another: Synthetic Data Is About To Transform Artificial Intelligence

One from NVIDIA:

And finally Towards Data Science: The Magic of Synthetic Data. Using Artificial Intelligence to Train… | by Dewayne Whitfield | Towards Data Science

I do apologize, I had assumed you were up to speed on these things, but I realize that perhaps not everyone spends as much time in this field as I do. Generating synthetic data is a 100% legit practice, and it comes with its own protocols as do all methods in science.

Still, performance depends on the task. In this case, I used a hybrid approach (generating synthetic data output but with real-world data as a partial input) so I do need to push back against the way you’re characterizing this - it is not 100% GPT-3 generating its own training data. The bulk of the input is wild data. If it were 100% GPT-3, as with many of my chatbots, then GPT-3 would be generating both the input and the output data.

To put it another way, only the output side of this model is synthetic, but the input is real.

The specific challenge came from a phone call. They were trying to incorporate GPT-3 into a care app that was supposed to (among other things) extract and summarize important information from numerous sources. The explanation they gave was that people with large care teams (multiple doctors/specialists/friends/family) have a lot of information and you can’t necessarily rely on one person to have all the information. So the example they gave was to ensure that the app or service always has the correct medications and treatments, given a large amount of unstructured text and changing logs. For instance, a medical chart may say “Discontinue medication X and try medication Y” - you would need something as flexible as GPT-3 to automatically parse that without a human.

However, GPT-3 tends to hallucinate and if you ask it to list a medication, well, it will just start listing medications whether or not they appear in the text.

In some cases, you can get DAVINCI-level performance from a properly finetuned CURIE model.


Hm, a lot of back and forth between the author of the video and me here. And I wish others would have chimed in too for a more unbiased perspective. But here is my summary of this thread and the video:

Q: Can GPT generate training data?
A: Yes, it’s called synthetic data. Think of it as a fancy “lorem ipsum” generator. GPT can at scale create text snippets according to your specs regarding topic, tone and sentiment. Works great for using text classifiers (which is what all the articles referenced above are talking about - these articles are not about a model generating its own training data).

Q: Can GPT generate its own finetuning training data?
A: No. At least there is no evidence that I’ve seen. Said video does not prove it (see below).

Q: Is GPT3 hallucinating/confabulating?
A: Not unless you want it to. If you keep the temperature low and phrase your prompt accordingly, I have not seen it making things up. In this particular use case, it wouldn’t fabricate an item in a list of “medications given” if the medication wasn’t in the patient record.
Curiously, you can however make it make up things if you do things like this:

> What famous scientists were kittens?
Some famous scientists who were kittens include Albert Einstein, Marie Curie, and Isaac Newton.
> When was Isaac Newton a kitten?
Isaac Newton was a kitten in the 17th century.

Q: Is it a good idea to synthesize training data on davinci and then use the output to finetune curie?
A: I don’t think so. Even though your ft-curie model might have a better understanding of the particular domain you are finetuning it to, it will still have a lesser general understanding of the world (idioms, synonyms, tone, etc.) and thus perform worse than if you had finetuned davinci directly. When you are trying to get to the bottom of your quality issues, you will be at a loss whether it’s just curie, your training data, your methodology or who knows what.

Q: Is it legit to test (or even spot check) your model on the same data that you used to fine tune it?
A: No, you have to use data that the model hasn’t seen before.

Q: Confabulating, hallucinating, accuracy, precision, bias, variability, hyperplanes,… what are all these terms?
A: Not really sure how they got in here as these are not used in scientific literature about language models. The graphic posted above is about metrology. See Precision Vs. Accuracy – Information Technology

Q: Is accuracy and precision the same as bias and variability?
A: Yes, see Accuracy and precision - Wikipedia

Q: So what is the metrics that we should use here?
A: In a use case where you want to extract a list of medications prescribed shall be extracted from an unstructured patient record we should use “recall” and “precision”. See Precision and recall - Wikipedia
If your model fails to extract all the “medications prescribed” you’re dealing with poor recall.
If your model erroneously considers words as medications prescribed you are dealing with poor precision.

David: I understand that you are trying to establish yourself as a an “educator” here, heavily promoting your own videos. But I wish that you would be a little bit more scientific - especially if you are using phrases like “wearing my professor hat”, falsely giving the impression you hold such credentials. Here is what I observed:

You let GPT3 generate completions based on unstructured patient records and use these completions to finetune GPT3. Then you use the same patient record to spot check how it works. That’s data it the model has already seen. Big no-no in machine learning.

The title of the video was “Finetune multiple cognitive tasks with GPT-3 on medical texts (and reduce hallucination)” but neither did your finetuning work nor did you show how davinci is it is hallucinating but your finetuned model doesn’t.

At the end the finetuned model delivers worse results than plain vanilla davinci and you blame it on the fact that you fintuned curie. So what did you actually mean to demonstrate?

Then in your latest follow up video you mix up the two questions “Can GPT generate its own finetuning training data?” and “Can GPT-3 generate training data?”. Don’t you see the difference?

1 Like

I’m not sure what I’ve done to offend you, I’m honestly trying to share my knowledge and my experience, but you’re becoming increasingly demanding and aggressive. If you’d like to continue this discussion in a more civil tone, I’d request that you soften your tone and ask questions about what you don’t understand.


@daveshapautomator You didn’t offend me at all, it’s just that I’m frustrated about a video which sounds interesting (if not too good to be true) at first but after taking a closer look its content is flawed (see the last 4 paragraphs in my previous post). But instead of “point taken - let me post an update” or something, you are posting even more fancy sounding, partially almost comical misinformation, yet avoiding answers.

Or take this thread: Can you stop it from making up sources?
I don’t know why, but your elaborate answer about “cognitively incomplete systems” has nothing to do with the question asked but you are promoting a video of yours. I don’t thinks that’s right.

Respectfully, I believe there are issues with your videos and your posts, I posted constructive criticism and I did in fact ask questions in the last two paragraphs of my previous post. Your’s to answer if you wish.

1 Like