I’m trying to use OpenAI APIs to search in a relatively big CSV file. the file includes some products and the CSV is structured like this:
id, name, color, material, price
and some sample data:
1, teddy bear, brown, polyester, 10
2, panda, black and white, cotton, 20
3, giraffe, yellow, plush,30
I want my users to be able to search for “The Most expensive item”, “Brown bear made out of synthetic material”, “any toy that is not cotton”, “biggest toy”, “toys for 3-year-old” or “toys for boys”
It’s not possible to use traditional tabular/document storage to query these things without having real-world data. e.g. gender bias on toys.
I tried to use embedded-ada, but something like “most expensive” or “not cotton” doesn’t work with it since vector rating cannot understand the context.
I also tried using Completion. it works with a small sample set, but I need to provide the whole list every time, which is not practical given the token limits and also the price.
I tried fine tuning davinci, by providing the description of the product as the prompt and the id as result. but I received gibberish results when I tried to use it e.g. 1818 product category12 - 433065 - The178055 - 4596528622468 and so on … non of these numbers exist in my dataset.
I don’t think you can query using SQL something like “Toys for a 3-year-old boy” and expect it to return a car toy instead of a barbie (gender bias) or something like “Super heroes” and expect it to return a batman figure.
That’s exactly the purpose of a relational database.
You can easily extract important information using simple logic.
You can also use entity extraction to convert a sentence into a database query. Keep in mind that this method won’t be perfect and will need continual training. I imagine you could tie the two together for further training/testing
“Toys for 3-year old boy” → Extract(Age, Gender, Etc.) → select_from_database(age=3, gender=boy)
As sps says (Happy birthday!), it’s a matter of structuring your database and creating a pipeline to manage each separate function.
Here’s a thought for your situation:
Instead of querying GPT as a database, why not use GPT to create tags for each product? You can then store the tags in your database, and also perform some nice analytics as well. I asked cGPT and this was its answer:
Create tags that relate the following item to an item store.
Item: Batman action figure
Tags:
[RESP] Batman, action figure, superhero, DC Comics, merchandise, collectibles, toy store, comic book store, pop culture.
For some reason my line separator doesn’t appear here. Another great aspect for this idea is that you would only ever need to query GPT once for each item, instead of each time someone searches something.
Extracting tags using GPT is a great idea. probably along running some filters to remove repetitive tags and reject some locally, it’s a very useful way to make the search better cheaply
It seems that my idea of searching in a large set of unstructured data is not achievable with the current APIs. at least cheaply.