What's the best file format for recommendation by using assistant API?

jinho.yoo · December 13, 2023, 8:54am

I’ve generated multiple text files containing information about YouTube channels. Specifically, I’ve created 15 files encompassing nearly 20,000 YouTube channels, each file containing details like titles, descriptions, subscriber count, view count, and various other information.

However, when attempting to utilize the Assistant API for retrieval, the process took close to 5 minutes and occasionally resulted in timeouts. Additionally, the responses received were occasionally deemed unacceptable. I’m questioning whether the TXT file format might not be suitable for addressing this retrieval issue.

Here’s an example of how I structured the data:

- title:~~~
- description:~~~~
- Youtube URL: ~~~~
- subscriber count:220000
- tags:......
.... 

- title:~~~
- description:~~~~
- Youtube URL: ~~~~
- subscriber count:220000
- tags:......
.... 

- title:~~~
- description:~~~~
- Youtube URL: ~~~~
- subscriber count:220000
- tags:......
....

This structure allows for easy retrieval of information, but given the performance issues and occasional unsatisfactory responses, I’m exploring potential alternative formats or solutions to better handle this volume of data with the Assistant API.

What would be better? I’m using Korrean, not English. So all documents are in Korean.

Mr_Lightspeed · December 13, 2023, 1:01pm

I had the AI itself run timing simulations on various file formats and was shocked that the fastest solution for me was a PDF with renderable text.

Perhaps consider a table of contents to map out your document. You could consider putting it into markdown format to help it parse easier? I think in the end it’s all about helping it help you.

udm17 · December 13, 2023, 1:54pm

For me, the JSON has always be the best performing model. It offers some form of consistency to the model

jinho.yoo · December 13, 2023, 2:27pm

Oh, table formatted data in Markdown can be a good candidate. I’ll try it.

jinho.yoo · December 13, 2023, 2:32pm

Supported files

Oh, JSON is now available for retrieval? It was available for code interpreter.

Then I need to make JSON like the following?

{
   "data":[ 
        {
          "title":"~~~",
          "description": "~~~~",
          "youtubeURL":  "~~~~",
          "subscriberCount": 220000,
          "tags": [ "termA", "termB", "termC"..]
        },
......

 ]

}

Mr_Lightspeed · December 14, 2023, 5:11pm

You are on the right path.

It all comes down to not just how fast you want the search done but how much flexibility you want with the document / format you have in mind.

There are wins in every column and even the simplest solutions can pay off a lot.

For example, for a larger PDF document that I don’t want to mess with I make sure to use the OCR capabilities in Adobe and compress. Only work done there is simply checking to make sure OCR didn’t get anything wrong so a fast proofread and done.

In your JSON solution, just take extra time to ensure it fits a proper JSON format. I’ll usually go online to a quick online validator and formatter for a quick check. The reason is that there are times I ask for the AI to redo the JSON for me and want to ensure I don’t hang it up.

Basically, help the AI help you.

I wish you good luck as you experiment with it.

warrenglenn · December 15, 2023, 2:39pm

When I asked ChatGPT itself, it told me that Text files were better than Excel files. Does anyone know if that’s true? Converting excel files into structured text is obviously do-able but a giant waste of time. Does anyone know if that’s true re: Excel / CSV?

roopak · February 21, 2024, 11:41pm

Any update here ?
@jinho.yoo did you find the correct format ?

I m working on similar thing, trying to extract 10-15 similar/suitable items from the file for users query.

When i try to attach .txt or .md files i get the following error
application/octet-stream error_code: unhandled_mimetype

tom_t · March 19, 2024, 9:27am

What effect you want to achieve? You don’t want to run code interpreter, but want Assistant to conduct data analysis? Or only find relevant records? We have to remember that there’s RAG underneath - and we don’t know details - how text is splitted, how it is searched. In case of such a data I would assume records could be to close to each other in semantic space and output not relevant nor predictable.

Topic		Replies	Views
Questions about File Search on assistants API assistants , gpt-4o	3	355	July 19, 2024
Best file format for assistant's retrieval mode API api , assistants-api	8	4214	January 12, 2024
What is the best file format to use as a knowledge-base? API assistants-api , assistants-files	6	2419	November 22, 2024
How to Optimize data for Knowledge Retrieval with Assistant API API	3	2630	December 1, 2023
Best file format for Assistants on table data API assistants , assistants-api	7	3135	December 17, 2023

What's the best file format for recommendation by using assistant API?

Related topics