How to Optimize data for Knowledge Retrieval with Assistant API

Hello Everyone,

I’ve been building an application with the assistant API Knowledge retrieval. Currently I’m “forcing” the API to look at my data by specifying the file ID in the prompt which works fine.

I have multiple smaller files currently stored in JSON format, of 1.5MB and less. The data is both text and numbers. I was wondering if anyone knows if JSON is the optimal format or if I should store it in a CSV, or another file type instead.

If I would store it in a CSV, each row would have 8 columns of related data.

Thank you!


Could you give more details how do you “force” it?

I specify what file-id I want it to look at to retrieve my answer, for example:

Use name_of_file, file id: file-xjhdgfdshgfdjshgf, to tell me what …

I have also specified the files it should use in the assistant instructions (with their name, not id), but multiple times it would tell me it couldn’t access the files. This has been solved by specifying the file-id in the prompt each time.

It really depends on how you are planning to use this data and what “questions” you want to be answered. If its structured data and you ask structured questions, you should use Code Interpreter for the retrieval. In that case both .json and .csv should work, but I would recommend .csv as it has less boilerplate.

The knowledge retrieval tool is great if your questions require some level of “abstraction”, e.g. the answer is within a text, but not the text itself. In that case a .csv might work, but from my experience a .txt file with markdown formatting works best. Formatting matters, thats why I wouldn’t recommend .json… too many distractions.

I wrote a detailed article on the knowledge retrieval tool and how to utilize it if you want to find out more. A short summary can also be found in the forum over here.