Over-prompting with irrelevant context

We have built Slack chatbot answering customer questions about our company, products, pricing and various technical details, injecting context from embeddings built from our web content and long support history in Slack.
At the moment we use GPT-4 8K model and allocating 4K tokens to each response which we split further giving 2K to web content and 2K to Slack history for context.

Our documentation on web tend to have quite long pages that are now chunked to ~1K and I feel having full pages in context would help generating more precise answers.

Now, considering one day 32K model will be available I am wondering is it worth to fill that token space with context and how to balance out irrelevant one and not to waste tokens if it is not applicable at all?

Since context is selected based on cosine similarity of question, on top we should have the most relevant content with degrading relevancy further.

So my questions are:

  1. How to draw the line to really include only relevant context?

  2. Does the over prompting with context degrade quality of the answers or only relevant one is used to generate answers?

  3. How to NOT “consume” all allocated tokens for questions like “How are you”?

Please share your experiences and thoughts.

One method I use is to run the returned topK embeddings into a prompt for GPT with the query appended of “of these returned vector embeddings which are the top {X} that would best support this question {user_query}”

Couple of down sides, usually you can’t get super long replies so you have to limit your new context to 3 or 4 (depending on embedding chunk size) and the additional token cost, but what you gain is a concise actual main prompt and usually some great answers.

Do that along with doing a standard “all the embeddings” style one and then do a 3rd query asking the AI to pick the best reply from each and use that as the finial answer.

2 Likes

Thank you for sharing your ideas!

I’m aware of topK but the issue is that large documents tend to be very fragmented in small chunks and just having a bunch of small parts does not provide perspective of the whole document in order to generate correct answer. So I’m filling in as much context as possible with assumption and hope that more is better than too few. But is it true? How about cases when relevant context is very first from topK and would be sufficient, does the rest (clutter) somehow affects quality of the response?

This is exactly why I then run the returned embeddings through the model again, not to ask questions on the embeddings, rather, to ask the model what it considers the best, most relevant of those embeddings given the original question, and to pick it’s top few.

You can then use those as the new embeddings, with the knowledge that the AI has pre screened them for relevancy not just using an ada-002 model but a much more powerful one.

1 Like

This approach sounds interesting. Thank you! Will give it a try.

Could you please share more details on “run the returned embeddings through the model again”?

Sure, if you query an embedding you first vectorise your query text and then you see which of your embedded data is “similar” then you get back the closest matches. Now take those closest matches (the top 3 say) and put that in a prompt and then append some instructions like “Given this as context, please answer this user question {user_question}” and the user question is the embedding query you first ran.

2 Likes

Possibly try in your preprompt script have it prefaced with " keep your reply as short as possible" or something like that

1 Like