FineTune GPT3 model to work as Chatbot knowledge context question

Hey there!

I’m working on a chatbot project and I want to be able for chatbot to answer questions from visitors and take answers from documentation that is provided in the prompt.

You are customer support service AI Chatbot.
You only provide answers that can be found in documentation.
If you don't know the answer or don't have enough information say "Sorry I don't know"
documentation: ${documentation}
question: ${question}

And this works as is however I’m running into a problem where I have big documentation string when I stitch multiple documents for chatbot to know.

I’m curious of ways to sort of “preload” documents into the model and update prompt to now include it at all.

Any suggestions?



One idea is to shorten the documentation string by picking sections of the full documentation which are likely related to the question. Feed the most likely section (or sections if they’re all small) as ${documentation}. A baseline using the Embeddings endpoint:

  1. [preprocessing] split up the whole documentation by paragraph or section or something
  2. [preprocessing] encode each section as a vector embedding, and store them so that you don’t need to recompute them
  3. encode the question
  4. set documentation to be the section from the documentation w/ the lowest cosine distance to the encoded question.

You could use heuristics to figure out which section to select instead of using the closest one in embedding space. Perhaps the question mentions certain keywords.

A far more work and data-intensive solution is to train a model which predicts the section which a question is most related to. You’d need labeled examples of (feature=question, label=section of documentation). This is a text classification task. To get good performance, you’d likely need 100s of examples (regardless of the modeling approach), and the documentation would have to be somewhat fixed (which it usually ain’t).


This was very helpful! I used embedding to find most relevant documentation and thus avoiding using huge prompt. Thanks!


Glad to help!

I might look into the embeddings endpoint myself. But I’m not sure how their new embeddings models were trained for similarity tasks, and how performant they are. Overall, did you find that the cosine distance b/t a question embedding and a documentation embedding was correlated w/ how similar they really were?

This is a pretty awesome answer, Cheers !!! will try this out soon. I did not consider predicting vs simple cosine distance.