Hm, a lot of back and forth between the author of the video and me here. And I wish others would have chimed in too for a more unbiased perspective. But here is my summary of this thread and the video:
Q: Can GPT generate training data?
A: Yes, it’s called synthetic data. Think of it as a fancy “lorem ipsum” generator. GPT can at scale create text snippets according to your specs regarding topic, tone and sentiment. Works great for using text classifiers (which is what all the articles referenced above are talking about - these articles are not about a model generating its own training data).
Q: Can GPT generate its own finetuning training data?
A: No. At least there is no evidence that I’ve seen. Said video does not prove it (see below).
Q: Is GPT3 hallucinating/confabulating?
A: Not unless you want it to. If you keep the temperature low and phrase your prompt accordingly, I have not seen it making things up. In this particular use case, it wouldn’t fabricate an item in a list of “medications given” if the medication wasn’t in the patient record.
Curiously, you can however make it make up things if you do things like this:
> What famous scientists were kittens?
Some famous scientists who were kittens include Albert Einstein, Marie Curie, and Isaac Newton.
> When was Isaac Newton a kitten?
Isaac Newton was a kitten in the 17th century.
Q: Is it a good idea to synthesize training data on davinci and then use the output to finetune curie?
A: I don’t think so. Even though your ft-curie model might have a better understanding of the particular domain you are finetuning it to, it will still have a lesser general understanding of the world (idioms, synonyms, tone, etc.) and thus perform worse than if you had finetuned davinci directly. When you are trying to get to the bottom of your quality issues, you will be at a loss whether it’s just curie, your training data, your methodology or who knows what.
Q: Is it legit to test (or even spot check) your model on the same data that you used to fine tune it?
A: No, you have to use data that the model hasn’t seen before.
Q: Confabulating, hallucinating, accuracy, precision, bias, variability, hyperplanes,… what are all these terms?
A: Not really sure how they got in here as these are not used in scientific literature about language models. The graphic posted above is about metrology. See Precision Vs. Accuracy – Information Technology
Q: Is accuracy and precision the same as bias and variability?
A: Yes, see Accuracy and precision - Wikipedia
Q: So what is the metrics that we should use here?
A: In a use case where you want to extract a list of medications prescribed shall be extracted from an unstructured patient record we should use “recall” and “precision”. See Precision and recall - Wikipedia
If your model fails to extract all the “medications prescribed” you’re dealing with poor recall.
If your model erroneously considers words as medications prescribed you are dealing with poor precision.
David: I understand that you are trying to establish yourself as a an “educator” here, heavily promoting your own videos. But I wish that you would be a little bit more scientific - especially if you are using phrases like “wearing my professor hat”, falsely giving the impression you hold such credentials. Here is what I observed:
You let GPT3 generate completions based on unstructured patient records and use these completions to finetune GPT3. Then you use the same patient record to spot check how it works. That’s data it the model has already seen. Big no-no in machine learning.
The title of the video was “Finetune multiple cognitive tasks with GPT-3 on medical texts (and reduce hallucination)” but neither did your finetuning work nor did you show how davinci is it is hallucinating but your finetuned model doesn’t.
At the end the finetuned model delivers worse results than plain vanilla davinci and you blame it on the fact that you fintuned curie. So what did you actually mean to demonstrate?
Then in your latest follow up video you mix up the two questions “Can GPT generate its own finetuning training data?” and “Can GPT-3 generate training data?”. Don’t you see the difference?