Best method of injecting relatively large amount of context to be leveraged in a response

I’ve seen multiple related posts about this but no clear recommendations.

If organizations are looking to inject a relatively large amount of context, say 4000+ tokens of something like a list of products to recommend from, or relevant documentation to provide or reference, what is the best method for the model to consider this context during generation, and then include it as it deems it relevant as part of the response?

I have seen the embeddings and semantic search suggested in some posts with mixed success being cited, also I’ve seen users attempting to use fine-tuning with mixed success as well.
we have also considered building a retrieval plugin, but I don’t believe capabilities exist to interact with plugins via API yet, so this would limit implementation to the ChatGPT plugin interface only.

This clearly is a common use case. Apart from providing the full context prior to every prompt (which would likely become prohibitively expensive) what would the best options be to achieve this context injection?


At this time, embeddings are still the best answer.

Even with increasing context sizes (token limits) performance will be improved by narrowing the scope of the context included in the prompt.

I’d avoid any fine-tuning discussions until after you make something work with embeddings for a number of reasons. Otherwise you’ll find yourself in a classic over-engineering situation, one which is likely to perform worse even if you could make it work (at least as far as utilizing external/proprietary context in a conversational UX is concerned).

And while it may seems like you need >4K/8K tokens, in most cases this isn’t true. That’s where the embeddings can help most. Essentially, “smart chunking” will allow you to cherry pick the context used in your prompt based on the user query.



General background comment from someone who built and sold a product recommendation SAAS company back in the 2000’s, and later performed due diligence on several others.

‘Product Attributes’ make a very poor basis for recommendations, if you are looking for recommendations people might actually want to buy.
I’m also skeptical of embeddings for this application. If you are going to try them, I would certainly at least include sales data and customer attributes in the embedding. Also helpful to include any product description text and review text.

1 Like

Welcome to the community @mathewkw

@wfhbrian is right. Embeddings is the way.

I’m curious if you’ve tried it.

If you haven’t, what’s stopping you? Is there a technical roadblock etc.?

1 Like

Great thoughts and assumptions.

You can accomplish similar results to the retrieval plugin using GPT to form GraphQL queries. That combined with embeddings is… beautiful. Or just use a qna module.

As mentioned though embeddings are definitely the way to start, and may be all that you need


Thanks for all the great feedback! One issue for us with embeddings as bruce.dambrosio pointed out for the example of product recommendations, is that a fair amount of relevant detail needs to be included in the content for the embeddings to generate highly relevant recommendations based on natural language text and queries. We’re considering using GPT to also systematically generate stronger product descriptors specifically engineered to result in highly relevant embeddings as well.

I’m also trying to think of a way to leverage the GPT models’ own embeddings and deep knowledge of virtually all widely available products, and simply pass a “filter” of sorts, stating something like “here is a list of the products we offer (assume say 5000+ product names), provide users relevant product recommendations, but only from this subset of our products”

This results in extremely accurate product recommendations which are generated from the deep knowledge and context of these model’s training data, the only issue is max token limitations, and pricing of passing large token contexts each time in a prompt/multiple prompts.

Does anyone know of a method to inject large context without having to pass it in each prompt chain? Something similar perhaps to the “memory” function seen in the retrieval plugin use case, or potentially achieved through fine-tuning to teach the model only to recommend products on the specific list?

You can keep a contextual stack in your database to understand “it” or “they” without relying on a summarization. For recommendations you would need to use a separate AI. Weaviate has a module which accomplishes this without a separate database. . I am still experimenting & learning with it though.

There are graph databases as well and their embeddings are very powerful for finding relationships between products and users. If anyone else could chime in, I would appreciate it.

What you’re thinking if using is a classifier. But GPT 3.5 is close to the same price as Ada and imo is just as good. Also, you’d have to think about what happens if there’s a change in a product, or the attributes. It doesn’t work. Have you noticed that GPT can have multiple variations of “truth”, based on the time it was truthful?

Google has some great documentation on building a recommendation AI.

Honestly, these are all great thoughts but i think you’d be best starting with a wonderful database like Weaviate and using a single form of embeddings, and then start building up. For documentation Pinecone is nice and clean. Maybe it’s just me but Weaviate is all over the place.

Also, I think there’s too much reliance on GPT. It’s not necessary, or efficient to send it large chunks of data for it to select and choose. As you have already theorized, the token amount can grow quite big quickly once comparisons begin - and they will. Recommendation engines can perform this already without wasting tokens on GPT.

Something else you could try is summarisation. Either in addition to vector search, as others have mentioned, or instead of, if the content is only a little over the context window limit.

There are different methodologies for summarisation, but one is to break your content into chunks and then progressively refine the answer using each chunk of content in turn. Appropriately called, “refine”.

The idea is that you iterate through chunks of content, passing in the answer from the previous iteration and prompting the model to update the answer if the current chunk of text included relevant information.

Another method is map reduce. Where you pass a chunk of content and a prompt to to get a preliminary answer, and then pass all those answers into the model to get the final response. Map reduce is faster since it can run in parallel, but “refine” generally gives better results.

You could implement it yourself or Langchain includes methods for doing both refine and map reduce.

Thanks all for the feedback and alternative options. In case anyone is interested, here is another great method I found for what I was looking to accomplish, provided by by OpenAI in their cookbook git: “Search & Ask” approach:

openai-cookbook/Question_answering_using_embeddings.ipynb at main · openai/openai-cookbook · GitHub

  • Takes a user query
  • Searches for text relevant to the query
  • Stuffs that text into a message for GPT
  • Sends the message to GPT
  • Returns GPT’s answer

Full procedure

Specifically, this notebook demonstrates the following procedure:

  1. Prepare search data (once per document)
  2. Collect: We’ll download a few hundred Wikipedia articles about the 2022 Olympics
  3. Chunk: Documents are split into short, mostly self-contained sections to be embedded
  4. Embed: Each section is embedded with the OpenAI API
  5. Store: Embeddings are saved (for large datasets, use a vector database)
  6. Search (once per query)
  7. Given a user question, generate an embedding for the query from the OpenAI API
  8. Using the embeddings, rank the text sections by relevance to the query
  9. Ask (once per query)
  10. Insert the question and the most relevant sections into a message to GPT
  11. Return GPT’s answer
1 Like

You will need an external vectorstore and a memory strategy to retrieve the context needed.