What's the best file format for recommendation by using assistant API?

I’ve generated multiple text files containing information about YouTube channels. Specifically, I’ve created 15 files encompassing nearly 20,000 YouTube channels, each file containing details like titles, descriptions, subscriber count, view count, and various other information.

However, when attempting to utilize the Assistant API for retrieval, the process took close to 5 minutes and occasionally resulted in timeouts. Additionally, the responses received were occasionally deemed unacceptable. I’m questioning whether the TXT file format might not be suitable for addressing this retrieval issue.

Here’s an example of how I structured the data:

- title:~~~
- description:~~~~
- Youtube URL: ~~~~
- subscriber count:220000
- tags:......
.... 

- title:~~~
- description:~~~~
- Youtube URL: ~~~~
- subscriber count:220000
- tags:......
.... 

- title:~~~
- description:~~~~
- Youtube URL: ~~~~
- subscriber count:220000
- tags:......
.... 

This structure allows for easy retrieval of information, but given the performance issues and occasional unsatisfactory responses, I’m exploring potential alternative formats or solutions to better handle this volume of data with the Assistant API.

What would be better? I’m using Korrean, not English. So all documents are in Korean.

4 Likes

I had the AI itself run timing simulations on various file formats and was shocked that the fastest solution for me was a PDF with renderable text.

Perhaps consider a table of contents to map out your document. You could consider putting it into markdown format to help it parse easier? I think in the end it’s all about helping it help you.

For me, the JSON has always be the best performing model. It offers some form of consistency to the model

1 Like

Oh, table formatted data in Markdown can be a good candidate. I’ll try it.

Supported files

Oh, JSON is now available for retrieval? It was available for code interpreter.

Then I need to make JSON like the following?

{
   "data":[ 
        {
          "title":"~~~",
          "description": "~~~~",
          "youtubeURL":  "~~~~",
          "subscriberCount": 220000,
          "tags": [ "termA", "termB", "termC"..]
        },
......

 ]

}
2 Likes

You are on the right path.

It all comes down to not just how fast you want the search done but how much flexibility you want with the document / format you have in mind.

There are wins in every column and even the simplest solutions can pay off a lot.

For example, for a larger PDF document that I don’t want to mess with I make sure to use the OCR capabilities in Adobe and compress. Only work done there is simply checking to make sure OCR didn’t get anything wrong so a fast proofread and done.

In your JSON solution, just take extra time to ensure it fits a proper JSON format. I’ll usually go online to a quick online validator and formatter for a quick check. The reason is that there are times I ask for the AI to redo the JSON for me and want to ensure I don’t hang it up.

Basically, help the AI help you.

I wish you good luck as you experiment with it.

1 Like

When I asked ChatGPT itself, it told me that Text files were better than Excel files. Does anyone know if that’s true? Converting excel files into structured text is obviously do-able but a giant waste of time. Does anyone know if that’s true re: Excel / CSV?

1 Like

Any update here ?
@jinho.yoo did you find the correct format ?

I m working on similar thing, trying to extract 10-15 similar/suitable items from the file for users query.

When i try to attach .txt or .md files i get the following error
application/octet-stream error_code: unhandled_mimetype

What effect you want to achieve? You don’t want to run code interpreter, but want Assistant to conduct data analysis? Or only find relevant records? We have to remember that there’s RAG underneath - and we don’t know details - how text is splitted, how it is searched. In case of such a data I would assume records could be to close to each other in semantic space and output not relevant nor predictable.