Workaround with knowledge files bigger than context window

My use case looks like:

Chunked knowledge files in my customGPT have in sum a higher size than the context window of 32k tokens.

But for each query I need the full evaluation of knowledge data - because knowledge files contain technical data to different device models, and queries are about selection of correct data to certain device model.

Q: What is the way to evaluate all knowledge files for each query, if their sum size is higher than context window?

It should read the first few and then the next few

@rjywjy
What happens if context window is full and only a part of knowledge files was read?

Depending on whether your question involves context, by design, the gpt’s adaptive attention mechanism should help you recall the appropriate part, but beyond that, it may cause forgetting, and the recall rate of gpt is only 100% at 128k

1 Like

i’d offload the processing of the inquiry against the knowledge files in a separate thread. then we you get result, send it back to the main thread.

edit: i see the tag refers to custom gpt. in this case, it won’t work. perhaps use action to offload the inquiry to remote processing.

Do you mean, it would work in AI assistant context, not in customGPT?
Could it work with a vector database, attached to customGPT with an action?

it could. although semantic search itself does not need any APIs and you can compute it on your own with your vector DB, one issue is you still need to vectorize the user inquiry.

When directly interacting with API models, if you attempt to send more input to a model than the available context length, you will receive an API error.

If you send context input just under the limit, and do not specify a max_tokens parameter to make a reservation of that space solely for forming the output, you can have little space left for output creation, receiving a truncated response.

ChatGPT however manages the loading of context with its own technology. Old chat information must be discarded to continue the chat session in its full capacity. Newly-retrieved knowledge from searching file database pushes out that old chat faster.

So just like rag parses the file, quantifies it and feeds it back to gpt? Can you go into the specific details?

1 Like

The documentation can go into specific details. https://platform.openai.com/docs/assistants/tools/file-search/how-it-works

Assistants’ file search returns chunks of documents, 5 for 16k context models, or 20 for 128k context models.

The Assistants AI makes a file search tool request with a search query, and these results are placed into the thread of conversation, and the AI is asked again with the additional data message. Instantly growing the size. Assistants generally keeps the input size from overflowing (but not overflowing your budget).


Developers using chat completions to interface directly with AI models can place any amount of knowledge augmentation context messages they want, on-demand or automatically.

1 Like

@_j how would you decide for this or that setup:

  1. customGPT+external vector database, like qdrant,
  2. Ai Assistant

@rjywjy
My use case is a kind of watch recommendation engine:

  1. Four knowledge files with descriptions of watch models, sized 4-110 Mb, at the moment not chunked,
  2. 6k watch images
  3. 1k pdf watch manuals

User uploads an image of own watch, prompts as text the watch model, like “Seiko Tuna”, or model number, and asks like “recommend me more similar watch models”.

Expected response: AI compares an uploaded image with images in the knowledge, searchs through textual knowledge and responses with a kind of matching/recommendation, like “best match from database is this or that”.

1 Like

I would decide if:

  • I am a subscriber to ChatGPT Plus, the web chatbot, the only place where you can create and use a “GPT” with low code to make API calls to some possible APIs that may allow connections, and want to pay a vector database company for the usage from ChatGPT subscribers if I share it.
  • or, if I am a programmer and want to place the AI in my own product, site, or application, and thus would use the API and my code the AI can call upon with functions.

I would reject both options, as my recommendation ends my last reply.

1 Like

chunk it based on individual watch models. Provide watch name as part of metadata. Search on meta information.
Images should be kept in a separate cluster. Once image is uploaded, emebed it and perform a vector search for similarity. Match it with the watch manuals and knowledge embeddings

1 Like

@aakash could you briefly explain, what is the need/the benefit of metadata files beside of embedding files?

@aakash
I got an additional question, maybe you can enlighten me on this:
As for default, I have an URL, and its content for embeddings calculation. The content is much longer than recommended chunk size of 128 tokens (average length is between 20k and 30k tokens).
What is the way to separate the content into meaningful chunks?

So my suggestions are based on your user-flow.

  1. Once the user uploads a watch image.
  2. The app will take the image and prepare an embeddings (CLIP or groundingDINO)
  3. Compare it with the existing images in the database.
  4. The top-N similarity matches can be shown to the user
  5. With each embedding you can save the metadata information such as the watch name and brand
  6. The metadata can then be used to retrieve relevant watch manuals and knowledge base
  7. This will lower your costs, since your context window will contain only relevant knowledge.
  8. The app will then respond to user queries using gpt API. The lower 3.5 versions can be used for this case.

Typically having metadata, helps you save costs, by allowing the use of smaller models. And the user gets a rich multi-modal (image+text) experience.

1 Like

I use a kind of custom logic to get over the context length. Typically breaking the knowledge base by sections (if they are available in the document). They are other chunking strategies for larger docs, you need to experiment to find the best fit for your doc/use-case.

My rule-of-the thumb is to use the smallest available model wherever possible. I will use a larger model(4-o) only when I have exhausted all other strategies.

EDIT: Missed adding the links to relevant articles for chunking strategies.

1 Like

Hi there, to me it looks like an issue in your application design where you need to store data separately for each of the device types and then query the knowledge base by device type before stuffing your knowledge into the prompt

Ideally you would have a database with multiple columns like instruction type, device type, knowledge text, relative information, source document, Etc.

When a specific instruction or knowledge file is needed for a known device, you query your database using the device type and the instruction type, then you select what is necessary and you stuff that into your prompt.

If you’re prompt is still bigger than 32k of tokens you definitely need to come back to your drawing board and rethink your application workflows.

How old is currently working what are you trying to do?