What is the best file format to use as a knowledge-base?

I’m asking that while using assistants with storage (Vector-Store) what is the best file formatting that the model could analyze correctly and faster than others? I’m using the .json format that had 40661 lines but it is too slow in retrieving data
note that the vector store has only one file for testing, production may have +10 files with more data

1 Like

I know that’s not what you expect as the answer but from my personal experience, the best format to store data for rag engine is a combination of a relational database plus a vector database plus a robust rest API to allow AI to do the queries over the data.

Storing inside a file works at the stage of a proof of a concept but then when you go to production you end up with this. So maybe you should consider this option from the very beginning.

As for the file for the markdown was good for me. But then it depends on the data you’re storing.

1 Like

Are you talking about the time between when a document is uploaded and when the vector store is made available, or instead, are you discussing the time for language generation.

The format of data doesn’t slow down the speed of employing the embeddings-based vector store. The data is extracted from any file into text that is readable by the AI, and is available quite quickly to the AI by only additional embeddings call at runtime being on the search query emitted by AI.

The problem with tabular data such as JSON is that when that knowledge is chunked, it may lose context of where in a hierarchy it is contained when sliced up. It might not be broken at a point related to the contained data object or the level of necessary understanding of individual items. Also, a semantic search has little use on data that is all essentially similar and many elements are in one chunk – you have a low chance of any query like “answer about everyone named smith” working.

Plain text is the most inspectable to you - it should have little additional processing before reaching the AI, and doesn’t have additional formatting or container to confuse embeddings or increase commonality.

1 Like