Assistants with knowledge base: How to determine atomic piece of information during chunking for more accurate retrieval?

Hello everyone,

I am encountering challenges in obtaining accurate responses from the assistant when using a set of files as a knowledge base. I believe that gaining some control over how the API chunks information could significantly enhance its efficiency

Challenges:

  1. Based on the documentation, I understand that the API can utilize entire files as retrieval chunks if they are small enough. However, I’m confronted with a limitation of being able to provide only a maximum of 20 files. This constraint pushes me towards submitting one large, comprehensive file instead of multiple smaller ones, leaving me without control over how OpenAI segments the information for retrieval.
  2. The retrieval tool segments uploaded files into chunks that frequently lose their original context. For example, in chapter describing a specific coding principle, if the user’s query doesn’t include specific terms from a text part outside of the part referencing the user query, crucial logic details may be omitted, leaving the retrieved information incomplete
  3. The retrieval process should be user-friendly and not require users to structure their queries with specific keywords. In its current form, if a user’s query lacks certain keywords, related information chunks, which are contextually relevant, are often missed by the retrieval tool.

Questions:

  1. How can I influence the way OpenAI chunks information to ensure each chunk represents an atomic piece of information? It could be done by inserting markers in the files, or specifying markers inherent to the file format (csv, markdown, yaml)
  2. Is there a technique or a possibility to include specific instructions in the assistant’s system settings that would guide the retrieval tool towards more accurate and contextually relevant information extraction?

I would be open to processing my knowledge base with AI to conform to these constraints, aiming to create a list of standalone facts and relationships between concepts that are more digestible for the retrieval process.

Any advice, insights, or experiences shared would be greatly appreciated!

Thank you!

7 Likes