RAG knowledge base solution - reduce number of files

KevinWhiteSX · July 21, 2025, 8:47am

Hello all:

I’m developing a RAG-type solution for my business. We will use the assistant API and file search against a vector store. Our knowledge base is divided into atomic topics, each containing significant knowledge. By experimentation, we have found that dividing a single topic into more granular atomic chunks results in the best performance. Each granular atomic chunk is plain English with a bit of markdown formatting. Crucially, the filename of each granular atomic chunk conveys necessary metadata linking back to our internal systems. When queried with a plain-English question, the assistant API includes references to these files, which is very important.

The above provides a brief overview of our prototyping process. The problem is a single topic; when granularly divided as described above, it can result in many 10’s of files. For example, one topic results in 128 granular atomic file chunks, and we have 100’s of topics.

I’m looking to understand if there is a better way. For example, can we combine our granular atomic file chunks into a single file for upload purposes but still return the references associated with each granular chunk in the generated assistant responses?

Any ideas will be appreciated.

Regards
Kevin

jlvanhulst · July 21, 2025, 4:01pm

Since the Assistants API will be retired next year, you might want to change that as well as part of your ongoing efforts.
If you give me some more detailed information/examples we can help better. You describe 100’s of topics. How long is a topic? And if the result is so many files, what happens afterwards?

KevinWhiteSX · July 22, 2025, 6:39am

Thank you for replying and informing me that assistants will be deprecated next year. According to their documentation, OpenAI will replace them with a functionally equivalent Responses API that will be made available early next year. It looks like they are simply evolving and consolidating their APIs which makes sense.

Regarding our content, our knowledge base is a collection of topics (JSON documents). Let’s say an individual topic consists of many paragraphs. Before we upload to OpenAI, we transform the topic and split it into separate paragraphs. Each paragraph is stored in an individual file. For example, topic1.json becomes:

topic1-paragraph1-{id}.txt
topic1-paragraph2-{id}.txt
…
topic1-paragraph128-{id}.txt

During splitting, we remove all JSON, and each resultant txt file contains only plain English and a little markdown. We split like this because the filenames conveys important metadata, particularly the {id} value. So, when we use the assistant for chat, the assistant includes references to the individual files, which allows us to verify the response. However, as said before, one topic (topic1.JSON) becomes 128 individual txt files. It works pretty well, but there are many files, and we have 100s topics.

Anyway, I hope this is a little clearer.

jlvanhulst · July 22, 2025, 3:54pm

Have you considered having the topics for one product in a single markdown, with the id in header like ## ID: <>
Not sure about the total size(s) but I don’t see what that wouldn’t be able to come up with consistnent results, with WAY less files?

KevinWhiteSX · July 23, 2025, 11:32am

Thanks once again for your reply.

We’ve decided to switch to using AWS knowledge bases and agents. AWS supports providing a CSV file along with a corresponding metadata file. Each row of CSV is chunked individually, and metadata from the row’s columns can be associated with that chunk. That’s perfect for our needs and solves the file explosion problem. I would provide a link to the AWS documentation, but OpenAI won’t let me post links for some reason!

Topic		Replies	Views
Increasing the maximum number of files that can be attached to an assistant Feedback assistants , assistants-api	12	2480	December 17, 2024
Only 20 files per Assistant? Feedback	4	1763	November 27, 2023
Overcoming many small files using Assistants Retrieval API assistants	2	1684	November 26, 2023
New "Assistants" API a potential replacement for low level "RAG" style content generation? API	9	8903	March 4, 2024
Assistants with knowledge base: How to determine atomic piece of information during chunking for more accurate retrieval? API assistants , assistants-api	0	1385	November 10, 2023

RAG knowledge base solution - reduce number of files

Related topics