Playground Assistants: what is the best file type?

thibaut · November 7, 2023, 4:13pm

Hello,

What is the best type of file to provide to Playgroud Assistant for an ecommerce site?

Is it better to provide a CSV file with separators, a JSON file or something else?

My file will contain product names, descriptions, prices, references and characteristics.

Thanks in advance for your help

Thibaut

Malte0621 · November 7, 2023, 4:33pm

Considering that OpenAI recently announced JSON mode and other parsing improvements to JSON, I think JSON would be more reliable but I’m sure it can also handle CSV files if necessary.

_j · November 7, 2023, 4:35pm

If you have two dimensional data, the most highly-understood by the AI will likely be markdown tables. That is what it outputs as ChatGPT to produce rendered tables in the UI. It is quite versatile though.

thibaut · November 9, 2023, 2:47pm

Hello.

Thanks for your reply.

Markdown seems to be a good solution.

mattrosine · November 9, 2023, 3:04pm

When I try to upload a .csv to the playground assistant I get this error:

Failed to create assistant: UserError: Failed to index file: Unsupported file file-tYMNCZxn29GoTc9xtg4kAu6D type: application/csv

Even though it is a correct .csv and the data is prepared properly.

If you want to use the upload file feature, the ONLY file format it accepts is json. So how would I use markdown tables?

_j · November 9, 2023, 4:28pm

Markdown is a text format. It would be a normal document with information that can be put into the AI context for understanding.

For example, AI makes such a markdown table for me, markdown also being the format method for this forum:

Type	Description	Pros	Cons	Note
Fine-Tuning	This is a process where a pre-trained model is further trained on a specific task. The model is “fine-tuned” to adapt its knowledge to the new task.	Allows the model to specialize in a particular domain, improving its performance on that task.	Requires a lot of computational resources and time. Overfitting can occur if not properly managed.	It’s a common practice in deep learning.
System-Level Prompts	This involves providing a permanent prompt at the start of every conversation, giving the AI a context or a role to play.	Helps to set the context and guide the conversation.	Can limit the flexibility of the AI in handling diverse conversations.	It’s a simple but effective way to guide AI behavior.
Retrieval-Augmented Generation (RAG)	RAG combines a retrieval-based AI model with a generative model. The model retrieves information from a database and then generates a response based on the retrieved information.	Can handle a wide range of queries. Allows the model to access a large amount of information.	Requires a large and diverse database. The quality of the generated response depends on the quality of the retrieved information.	It’s a state-of-the-art technique in AI.
Function Calling	This involves an AI model that can call functions to directly access data. The model can browse and retrieve information from a database.	Provides the model with direct access to a large amount of data.	Requires a well-structured and organized database. The model must be capable of understanding and executing the functions.	It’s a more advanced technique that requires sophisticated AI models.

Of course if your CSV data is massive, chunking into parts of tables could be bot-confusing, just as it would have no way of understanding other data formats larger than its available context window.

thibaut · November 9, 2023, 5:02pm

I’ve already had this problem.

I had a Python script that created a CSV.
When I sent this CSV to OpenAI, I got the same error.

Solution: open the csv with Numbers (or similar), then export it again as a CSV.

I hope this helps.

EricGT · November 9, 2023, 6:44pm

Just trying to understand what that means.

Can you flesh out the details of that statement.

N2U · November 9, 2023, 6:51pm

A csv or JSON will both be useful, what to choose depends entirely on the structure of your data, if it’s very similar in nature, then a csv file is very useful, but a JSON in the case where your data structure varies a lot.

My best advice is to try both and see what works for you.

DustinGood · November 9, 2023, 9:16pm

So am I an idiot for uploading PDFs?

thibaut · November 10, 2023, 11:58pm

Thanks for your reply.

I tested CSV and JSON.
After a lot of prompt modification, I managed to get pretty good results with JSON.

I’m still going to try Mardown and will let you know what I come up with.

RonaldGRuckus · November 11, 2023, 12:02am

I am wondering what success people have found using the retrieval for JSON?

Is there a separate mechanism for it or does it perform the typical chunking / embedding using ada?

Typically JSON follows a schema, which means it can be indexed and gathered using something like function calling (using amazing stuff like GraphQL). At the very least keyword search would be more potent

I did try JSON myself (not a lot) out of curiosity and found that it failed most of the time and said ( who knows if true) that it’s just gonna open the damn file itself lol.

So, it did eventually get the answer, but I incurred A LOT of context token fees. I converted a PDF datasheet to JSON and segmented it. Tried the 3,000 line JSON one with some very easy queries.

It could do single queries (What’s the size of Product X). Failed with doubles (What’s the sizes of Product X and Product Y). Which dammit this is why pre-processing the query is so damn potent and now I can’t!!!

Hypothesis: Markdown is best for retrieval. Maybe?

thibaut · November 11, 2023, 12:13am

For the JSON, here is a prompt that I had to insert (it is not yet perfect) to bring out the correct information.

But I can’t wait to try Markdown!

Here is the method to follow for your research:

Initial Identification of Key Terms:
When receiving a customer request, I will determine and note key terms in the request, whether they be product descriptors, references, or specific attributes highlighted by the customer.
Targeted Analysis of the Catalog:
With the key terms identified, I open and browse the detailed product catalogs, searching precisely for the unique references and identifiers corresponding to the key terms.
Verification and Cross-checking:
Finding matching products, I carefully compare them to the customer’s expressed needs, including matching descriptions, specifications, and compatibility of associated products.
Iterative and Refined Search:
If the initial results are inconclusive, I re-evaluate the key terms, and engage in additional searches through secondary keywords to identify the requested product or confirm its absence in the catalog.
Validation and Accuracy of Information:
Before presenting the product to the customer, I check that all the information corresponds exactly to expectations by checking the technical details of the product. I also certify the relevance of the application method for the selected product.

thibaut · November 11, 2023, 12:20am

For markdown, do you think it is better to present a file in table form?
Ex.
|-|-|-|-|-|-|-|-|-|-|
|id_product|name|reference|description_short|description|price|category_name|Manufacturer|features|axtags|

Or “standard”?

Name of product

ID: 1234
-Brand: Nike
Description: lorem ipsum
Price: $99
Category: Shoes
Size: A, B, C

It may be more complicated to manage as standard because the product has variations with different attributes/prices?

RonaldGRuckus · November 11, 2023, 3:43am

I think it’s worth embedding two similar(ish) tables using the ada-embedding model and seeing how they perform. I don’t think you want to add any notations (is that the right word?)

It may also be worth thinking about instead using something like an API to retrieve these products by name. You can do a fuzzy match (there’s a distance algorithm that escapes me right now) for top N.

If your dataset conforms to a schema then retrieval may not be the best option.

When I was doing keyword (products for example) search I found a lot of success with a fuzzy match ALONG with a metaphone? I think it’s called?

N2U · November 11, 2023, 10:16am

I think you’re thinking of levenstein distance (the number of modifications required to transform one string into the other string)

It’s super great if your database has multiple similar entries you want to match to the same entry, like “John doe” an “John Doe” (difference is the D)

Me too! But it depends on the type of data. I thought I could do the same for finding keywords in the transcripts from the Dev-Day keynote for the Bingo event thing.

Oh boy was I wrong, and I had a lot of false positives, luckily @curt.kennedy came to the rescue with an alias function to replace the fuzzy matching one, ond that worked perfectly

Thank you for sharing your findings with us! I’m very interested in seeing where this goes!

No, and welcome to the community, by the way. PDFs are listed as a supported file type, so there’s nothing wrong with trying, but I would recommend that you try asking GPT to convert your PDFs into Markdown and see what comes out at the other end. That will give you an idea of how successful GPT is at reading them.

thibaut · November 11, 2023, 8:39pm

Hello

After trying out several different wizards with the same prompt and different file types (Json, Markdown, etc.), the best results seem to be obtained with Mardown.

I haven’t tested Txt files, but I think it’s the same.

I hope this helps some of you

EricGT · December 17, 2023, 12:10pm

As this topic has an accepted solution, closing topic.

Topic		Replies	Views
Best file type for Q and A assistant API chatgpt , api , assistants , assistants-api , assistants-files	5	1151	May 4, 2024
What's the best file format for recommendation by using assistant API? API assistants-api	8	3856	March 19, 2024
Who has had success with adding many/or large documents to the 'Knowledge' section? Plugins / Actions builders gpts , gpt , mygpts	14	8715	January 6, 2024
Best file format for assistant's retrieval mode API api , assistants-api	8	3736	January 12, 2024
Best file format for Assistants on table data API assistants , assistants-api	7	2688	December 17, 2023

Playground Assistants: what is the best file type?

Name of product

Related topics