What is the best file format to use as a knowledge-base?

sayedmoataz9 · October 27, 2024, 12:52pm

I’m asking that while using assistants with storage (Vector-Store) what is the best file formatting that the model could analyze correctly and faster than others? I’m using the .json format that had 40661 lines but it is too slow in retrieving data
note that the vector store has only one file for testing, production may have +10 files with more data

sergeliatko · October 27, 2024, 2:42pm

I know that’s not what you expect as the answer but from my personal experience, the best format to store data for rag engine is a combination of a relational database plus a vector database plus a robust rest API to allow AI to do the queries over the data.

Storing inside a file works at the stage of a proof of a concept but then when you go to production you end up with this. So maybe you should consider this option from the very beginning.

As for the file for the markdown was good for me. But then it depends on the data you’re storing.

_j · October 27, 2024, 5:45pm

Are you talking about the time between when a document is uploaded and when the vector store is made available, or instead, are you discussing the time for language generation.

The format of data doesn’t slow down the speed of employing the embeddings-based vector store. The data is extracted from any file into text that is readable by the AI, and is available quite quickly to the AI by only additional embeddings call at runtime being on the search query emitted by AI.

The problem with tabular data such as JSON is that when that knowledge is chunked, it may lose context of where in a hierarchy it is contained when sliced up. It might not be broken at a point related to the contained data object or the level of necessary understanding of individual items. Also, a semantic search has little use on data that is all essentially similar and many elements are in one chunk – you have a low chance of any query like “answer about everyone named smith” working.

Plain text is the most inspectable to you - it should have little additional processing before reaching the AI, and doesn’t have additional formatting or container to confuse embeddings or increase commonality.

Topic		Replies	Views
How to Optimize data for Knowledge Retrieval with Assistant API API	3	2375	December 1, 2023
Best file type for Q and A assistant API chatgpt , api , assistants , assistants-api , assistants-files	5	1092	May 4, 2024
What's the best file format for recommendation by using assistant API? API assistants-api	8	3742	March 19, 2024
Best file format for assistant's retrieval mode API api , assistants-api	8	3565	January 12, 2024
Best file format for Assistants on table data API assistants , assistants-api	7	2549	December 17, 2023

What is the best file format to use as a knowledge-base?

Related Topics