Hello all:
I’m developing a RAG-type solution for my business. We will use the assistant API and file search against a vector store. Our knowledge base is divided into atomic topics, each containing significant knowledge. By experimentation, we have found that dividing a single topic into more granular atomic chunks results in the best performance. Each granular atomic chunk is plain English with a bit of markdown formatting. Crucially, the filename of each granular atomic chunk conveys necessary metadata linking back to our internal systems. When queried with a plain-English question, the assistant API includes references to these files, which is very important.
The above provides a brief overview of our prototyping process. The problem is a single topic; when granularly divided as described above, it can result in many 10’s of files. For example, one topic results in 128 granular atomic file chunks, and we have 100’s of topics.
I’m looking to understand if there is a better way. For example, can we combine our granular atomic file chunks into a single file for upload purposes but still return the references associated with each granular chunk in the generated assistant responses?
Any ideas will be appreciated.
Regards
Kevin
Since the Assistants API will be retired next year, you might want to change that as well as part of your ongoing efforts.
If you give me some more detailed information/examples we can help better. You describe 100’s of topics. How long is a topic? And if the result is so many files, what happens afterwards?
Thank you for replying and informing me that assistants will be deprecated next year. According to their documentation, OpenAI will replace them with a functionally equivalent Responses API that will be made available early next year. It looks like they are simply evolving and consolidating their APIs which makes sense.
Regarding our content, our knowledge base is a collection of topics (JSON documents). Let’s say an individual topic consists of many paragraphs. Before we upload to OpenAI, we transform the topic and split it into separate paragraphs. Each paragraph is stored in an individual file. For example, topic1.json becomes:
- topic1-paragraph1-{id}.txt
- topic1-paragraph2-{id}.txt
- …
- topic1-paragraph128-{id}.txt
During splitting, we remove all JSON, and each resultant txt file contains only plain English and a little markdown. We split like this because the filenames conveys important metadata, particularly the {id} value. So, when we use the assistant for chat, the assistant includes references to the individual files, which allows us to verify the response. However, as said before, one topic (topic1.JSON) becomes 128 individual txt files. It works pretty well, but there are many files, and we have 100s topics.
Anyway, I hope this is a little clearer.
Have you considered having the topics for one product in a single markdown, with the id in header like ## ID: <>
Not sure about the total size(s) but I don’t see what that wouldn’t be able to come up with consistnent results, with WAY less files?
Thanks once again for your reply.
We’ve decided to switch to using AWS knowledge bases and agents. AWS supports providing a CSV file along with a corresponding metadata file. Each row of CSV is chunked individually, and metadata from the row’s columns can be associated with that chunk. That’s perfect for our needs and solves the file explosion problem. I would provide a link to the AWS documentation, but OpenAI won’t let me post links for some reason!