Assistants with knowledge base: How to determine atomic piece of information during chunking for more accurate retrieval?

fabrice.hong · November 10, 2023, 3:09pm

Hello everyone,

I am encountering challenges in obtaining accurate responses from the assistant when using a set of files as a knowledge base. I believe that gaining some control over how the API chunks information could significantly enhance its efficiency

Challenges:

Based on the documentation, I understand that the API can utilize entire files as retrieval chunks if they are small enough. However, I’m confronted with a limitation of being able to provide only a maximum of 20 files. This constraint pushes me towards submitting one large, comprehensive file instead of multiple smaller ones, leaving me without control over how OpenAI segments the information for retrieval.
The retrieval tool segments uploaded files into chunks that frequently lose their original context. For example, in chapter describing a specific coding principle, if the user’s query doesn’t include specific terms from a text part outside of the part referencing the user query, crucial logic details may be omitted, leaving the retrieved information incomplete
The retrieval process should be user-friendly and not require users to structure their queries with specific keywords. In its current form, if a user’s query lacks certain keywords, related information chunks, which are contextually relevant, are often missed by the retrieval tool.

Questions:

How can I influence the way OpenAI chunks information to ensure each chunk represents an atomic piece of information? It could be done by inserting markers in the files, or specifying markers inherent to the file format (csv, markdown, yaml)
Is there a technique or a possibility to include specific instructions in the assistant’s system settings that would guide the retrieval tool towards more accurate and contextually relevant information extraction?

I would be open to processing my knowledge base with AI to conform to these constraints, aiming to create a list of standalone facts and relationships between concepts that are more digestible for the retrieval process.

Any advice, insights, or experiences shared would be greatly appreciated!

Thank you!

Topic		Replies	Views
What is the chunking strategy used by the Assistant? API assistants	6	5488	December 5, 2024
Assistant's Retrieval Chunks in Playground: Can the Size be Controlled? API assistants	1	1405	November 18, 2023
Navigating Context Chunking in OpenAI's Assistants API API	0	820	November 9, 2023
Feature Request: Chunk offset retrieval on Assistants Feedback feature-request , assistants , assistants-api , assistants-files , vector-store	1	58	July 17, 2025
How do I force the assistant to read all the content in the file being used for retrieval API api , rag , assistants-api	1	3609	December 5, 2023

Assistants with knowledge base: How to determine atomic piece of information during chunking for more accurate retrieval?

Related topics