Fine-tuning GPT-3 on entire conversations to mimic style and extract relevant knowledge

I have a dataset of conversations between a chatbot with specific domain knowledge and a user. These conversations have the following format:

Chatbot: Message or answer from chatbot
User: Message or question from user
Chatbot: Message or answer from chatbot
User: Message or question from user
… etc.

There are a number of these conversations, and the idea is that we want GPT-3 to understand the style of conveying this information as well as the knowledge that’s in the responses (i.e., respond similarly to questions posed by the User as in the dataset).

I tried the following when preparing and creating the JSONL file. The prompts are empty and each completion contains the entire back-and-forth conversation. I noticed that this mimicked the “style” of the chatbot when generating new output, however, but not the actual information based on the conversations in the dataset (it generated made-up answers from the model and went off track right from the beginning).

I am not sure how to go about structuring the dataset for the JSONL file for fine-tuning. It seems that the way I structured it only helps the model understand the style of the chatbot and not the domain knowledge it needs to answer the questions appropriately, without veering off track.

Any help in the matter is much appreciated!
Thanks in advance :slight_smile:


How much data do you have?

Also, I might format the data differently, where the prompt is the block of conversation leading up to the last post from the user and the completion is the very last response by the bot. Seems to be that you want it to learn to generate output for the bot only.

1 Like

IMO don’t try to have the bot remember factual information. Instead teach it to parse full sentences into useful queries of an existing knowledge base. So basically a thin wrapper to make things seem more natural.

Question: So i've heard about this cool starship thing & i'm wondering when it's gonna launch?
Parsed: When will the next starship launch?

And then we have some simple regex monitoring the message feed to filter out the internal messages not meant to be show to the user. In this example we would have code which submits the parsed question as a google search and returns the first featured snipped result. but ofc you need a fallback if the result does not make sense.

Returned: In July 2020, SpaceX anticipated a cargo Starship mission to Mars as early as 2022, followed by a crewed Starship mission to Mars in 2024. As of 16 October 2020, the cargo flight will happen in 2024/5 and the crewed flight in 2026/7.
Summarized: As early as 2022.

Where Summarized is what you might actually show to the user after filtering out the internal query.

There are obvious issues with making sure your search actually returns a useful answer as it’s first result, but you could maybe train the model to retry with multiple potential queries until it gets a result that seems correct.


That’s an interesting take on it. I think it seems I might have to do that since it only seems to understand the style only after a couple of fine-tuning tries.

How would you go about trying to achieve these two subgoals? I mean how do I go from one subgoal (mimicking style) to also extracting knowledge based on a specific domain?

Not a lot of data. There are around 12 back-and-forth conversations that were manually created to address several of the concerns we wanted the chatbot to talk about when prompted by the user.

Thank you for your suggestion. I tried your recommendation as a fine-tuning job and it basically learned to say the last line (completion line) in many different ways that one. However, it seems to have some of that knowledge attached to the responses as I prompted it and asked about specific stuff in the training dataset and it kinda seemed to know what I was talking about (but not to the fullest extent).

Yes, the output is for the bot only. Do you have any other suggestions I might try when fine-tuning it? I also ended up applying for the davinci model to fine-tune it (currently in private beta) as we’re trying to make it highly conversational while sticking to a given informational domain. Fingers-crossed we get into the beta and try it out.

This is a really cool approach. I would just be worried about the accuracy of the “googled” information since we’re dealing with sensitive data and addressing widespread misinformation through this chatbot.

However, your approach made me think (as did @m-a.schenk reply) that maybe this needs to be a multi-step process (probably 2 steps at least). I could potentially have it parse the question from the user into a knowledge-base query as you suggested and then instead of a google search, extract that information from a saved knowledge-base with domain information.

Oh yeah obviously don’t actually use google. You may want to use multiple layers of separately finetuned prompts, each highly specialized. For example:

  • “Is this message a question about one of the topics [Topic 1], [Topic 2], or [Topic 3]?”
  • “Reword this question as a concise search query.”
  • “Answer Yes or No, Does the above information contain a useful answer to the following question?”
  • “Using the above information, answer the following question:”

This is your biggest problem. You will need to find more chat corpuses or generate synthetic data. I would not pursue those other methods (searching the internet/searching a DB) because (1) they have already been done and (2) they are gimmicks, shortcuts. Facebook did it with Blender Bot: Blender Bot 2.0: An open source chatbot that builds long-term memory and searches the internet

You’d get far better results by using GPT-3 to its fullest extent.

Now, as for ideas:

  1. Synthetic data. As you can see, GPT-3 is great at simulating conversations. If you do have a DB or KB, you can use that to create synthetic data, imagined conversations. Thousands of them. This way will be 100x easier than method 2.
  2. You could split up the process but please do not use regular semantic search or internet search. You could use the Answers endpoint for GPT-3. This, as you will quickly discover, adds several orders of magnitude more complexity. (Trust me, I built a cognitive architecture like this)

Now, if you can’t get #1 to work, and must go with #2, here’s what you’ll need to make sure it’s stable and reliable:

  1. A fine-tuned dataset for intent/question extraction. This isn’t so hard. Just find some question/answer datasets on the internet and reverse them, or something like that. You can also synthesize this dataset based on your corpus of domain text. Basically you will want a dataset that that can read any chat log and infer the user’s intent. The hard part of this is then matching that intent to unstructured knowledge or information.
  2. That hard part is where you’ll want a second fine-tuned model that specializes in matching messy/noisy queries to higher quality information. This, from my work, is a non-trivial problem. For instance, I’ve tried using GPT-3 to match a word description to words (reverse dictionary). It sucks at this kinda thing. The problem is because the problem space is so big - you could be asking for any number of things, and this gets even worse when you have silly end users asking questions. (Tangentially, this is why I recommended sticking with one model and synthetic data trained on your domain information). Again, you could use the Answers endpoint here, but that will just give you the answer, and you want something conversational.
  3. Finally, you’ll need a last model that integrates the original chat log and the answers found and generates new output. You can do this with a simple prompt (I had great success with this in my experiments). For some reason, GPT-3 seems to be able to integrate seemingly unrelated information into a conversation. I have quite a few examples in my book.

But essentially, you can achieve your goals through a single fine-tuned model or through a cognitive architecture - guess which one is simpler? :stuck_out_tongue:

My book, if you’re curious. I’ve got a ton of example prompts in the back that may be helpful to you.


I’m not sure if you realize, but this is an argument against your point, not for it.
For real world applications you want to build something tried and tested. Original Research is for OpenAI, not random business trying to use their product.

Nah. The industry leaders all do research, as do every startup who wants to do something disruptive. If you just stick with “tried and true” there’s no market share to take.

JSONL, etc… had me lost in the cloud for a week🤯 I’ve just started to recover. At the end of this lived experience, I found that the Google Cloud model is racist. I brought it to their attention and no one really seemed to care.

AI is changing at such an incredible rate, my previous work seems obsolete. I understand why research is about 5 years behind and why I must part from the Google Cloud :call_me_hand:

1 Like

Racism is what happens when you train on general public data. I know OpenAI is working to bake in safety and remove bias, but that could be an intractable problem. Imperialism and colonialism is baked into the entire English language as well as a huge chunk of its corpus of data. Much of the bias is implicit, though if you read certain materials, it is also explicit. In my experiments, GPT-3 is equally capable of taking on any racist/bigoted/intolerant position. You can have GPT-3 create a man-hating misandrist who wants to castrate all men. Hell, you can even invent cultures/religions and GPT-3 can neural-transfer bigotry into those imaginary spaces.

When you achieve mastery of language, that means you can do anything with it - good or bad. Very often, in order to defeat something, you must understand it. When you understand it, you can recreate it, even if your goal is defeat it.

Imagine this scenario: You want to model radicalized teenagers on the internet. So you finetune a model to do just that. You then set it up as a GAN to create a second model that is an expert in deradicalizing said internet teens. In the course of creating a deradicalization bot, you have also created an expertly racist and sexist bot. Two sides of the same coin.

Fun fact: I tested racism/bigotry against an earlier version of my Raven project, and it succeeded in gently pushing the conversation away from explicit content, without a GAN.


You might want to a) add cadence metric via concurrent side-channels; b) an onomatopoeic cacophony dictionary parser; since the task of learning is delegate to the neurons with soft max cohomologies. Then, for example, I.e: “Pf, I’m pf…Donald Pff Duck”, “Guantanamera, Guajira, Guantanamera” or “Hope Springs …” for the suffixed grapheme (c) and perhaps the ontology of some output pairing of hyper log log prints of some byte string with context-free operators, which pushes out tongue-twisters or bloom filtered monotonic morphemes with modality.