Finetuning model to learn from my product data

Hi,

I try now since days to send a CSV file with my product data to finetuning a model.
So my goal is to send the CSV via CURL to the API to the model text-ada-001 (just fo a simple search function).

If I use the endpoint ‘https://api.openai.com/v1/files’, I got an error msg: Additional properties are not allowed (‘model’, ‘suffix’ were unexpected).
But in my understanding, I need to declare, what model I want to finetune and also I would like to rename this finetuned model into a different name ($suffix), so that I can use it under that name.

I also try to use the endpoint ‘https://api.openai.com/v1/models/text-ada-001/versions/1/train’. But then I got an error, that this URL is wrong.

Here my PHP function:

function upload_myfile($file_path, $api_key, $suffix)
	{	
		$api_endpoint = 'https://api.openai.com/v1/files';
		// $api_endpoint = 'https://api.openai.com/v1/models/text-ada-001/versions/1/train';
		
		$file = new CURLFile($file_path);
		
		$data = array(
	        'file' => $file,
	        'model' => 'text-ada-001',
    		'suffix' => $suffix
	    );

		$ch = curl_init();
		curl_setopt($ch, CURLOPT_URL, $api_endpoint);
		curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
		curl_setopt($ch, CURLOPT_POST, true);
		curl_setopt($ch, CURLOPT_POSTFIELDS, $data);
		curl_setopt($ch, CURLOPT_HTTPHEADER, array(
			'Content-Type: multipart/form-data',
			'Authorization: Bearer ' . $api_key
		));

		$response = curl_exec($ch);
		curl_close($ch);
		
		return $response;
	}

Hope somebody can help. :slight_smile:

Looking at the doc here (https://beta.openai.com/docs/api-reference/files/upload?lang=curl), here is what you body request should contains:

curl https://api.openai.com/v1/files \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F purpose="fine-tune" \
  -F file='@mydata.jsonl'

that’s why ‘model’ and ‘suffix’ are unexpected.

You can define the model and suffix when creating the fine-tune - after you uploaded your file (https://beta.openai.com/docs/api-reference/fine-tunes/create?lang=curl)

curl https://api.openai.com/v1/fine-tunes \
  -X POST \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
  "training_file": "file-XGinujblHPwGLSztz8cPS8XY"
}'

Guess I understand, the ID in the upload response is the file-key to bring it in relation with the model I want to use.

OK, I’ll try it out, thanks for your help. :+1:

cooper64

1 Like

I guess I need some help. I’ve been going in circles for days with no results.

I’ll try to use the Model ADA as an simple search engine fo my product data.
The documentation tells me, Embeddings is a possible and nice way for that.
The documentation also say, its better to fin-tuning a Model before, for better results.
So I build a file with the product data and send it to the endpoint ‘https://api.openai.com/v1/files’.
The response show me an error: ‘Expected file to have JSONL format with prompt/completion keys with string values.’

I send it as an JSON-Object.

What me really confusing is: I just have the product data to send as ‘prompt’, I can’t define a ‘completion’ string. There is no completion to define in my case. The Model should just learn the product data to use it later for search requests.

For example a product as JSON object:
{ "product_id": 1, "title": "Apple iPhone 12", "price": 800, "attributes": { "color": "black", "storage": "128GB", "camera": "12MP" }, "description": "The latest iPhone from Apple with advanced camera...", "search_keywords": ["iphone", "apple", "smartphone"] } .

So, how should it looks like as an dataset in an txt-file with prompt and completion that the API can handle?

Or maybe I’m completely on an wrong way. :thinking:

Thanks for any imput…

Hi @cooper64,

What do you expect as behavior ?

Here is a way to do a search engine with Picone and OpenAI.

  1. embeded your prompt with openai (ADA-002) using the keyword to find the product. In this case, let’s assume the title: “Apple iPhone 12”.
  2. store the result in Pinecone. You can put in metadata what you have in “attributes”. You can even filter on them later on.
  3. when a user is typing “iphone 12”, you should take the string, send it to openai (ADA-002) and do a query on Picone. You can do k_top = 3 which will return the most 3 likely results

using above, you should get “Apple iPhone 12” returned with all the attributes with it (to do so, add metadata to true in Pinecone.)

I’m just trying to use openAI and to integrate an intelligent product search into our online shop.
My last attempt was to get usable results via embeddings. So I generated embeddings from my product data and also embeddings from the search term and tried to compare them with cosineSimilarity() function. But the results were beyond bad. Maybe it’s because of the model used, maybe the German language in the data is a problem or I need to fine-tune the model first.

The attempt to fine-tune a model had previously failed because the API did not accept my data format in the upload textfile. I was confused about that a dataset requires a prompt and completion pair. I have just my product data as an string converted to JSON, one line in the textfile for every product.

I didn’t know Pinecone before, maybe that’s a better approach for me. I’ll take a look.

I’m just trying to use openAI and to integrate an intelligent product search into our online shop.

And I’m telling you a way to do it.

Do whatever you wanna do, I’m suggesting you a way to do it and for me, it’s working well and results are great.

thanks for support. I’ll try it out and I see in the documentation I have a lot to learn about this.
I just understand the half… :wink:

sorry, I was rude ! haha

but I really think a part of it comes from experimentation too. Don’t hesitate to use Pinecone.io, it’s free and will save you the hassle to do the cosine function.

no, no, it’s all fine :slight_smile:

so, in the meantime I learned a lot about pinecone and it was not a problem, to create an Index, upload around 100 product data-sets (including embeddings) for testing, write an script to generate embedding of an searchsting (via ada) and send it via API to pinecone and I got some results. So everything works as expected.

But the result ist the same bad result, like before as I’m compare the searchstring/products embeddings via my PHP cosineSimilarity() function.
And I’m really sure now, that the probel is to search in the embedding data, especially in the searchstring embedding. (it’s about because I compare the embeddings before from each product to each other and similar products were very well sorted to each other)

For example what I got back from my search:
Searchstring is: ‘LED lamp flat design’
Result Pos. 1: LED 1m stripe red IP63 with power supply
Result Pos. 2: LED power supply 12 volts 100 watts waterproof
Result Pos. 3: Socket strip 6x splash-proof IP44
Result Pos. 4: Tape roll 9 mm x 50 m silicone paper brown

in Pos. 9 I got a match: LED lamp DIMMABLE flat design 5 watts

You see, the result product are completely different and mostly they don’t including not just one single word from the serachstring.

So I’m really confused about this result, but I really would like to solve that, I don’t wanna give up! :wink:
By the way, I’m using ‘text-embedding-ada-002’ to build the embeddings of the products and the searchstring.

Maybe another important fact: also Pinecone deliver me for everey product a score around 0.81 - 0.87.
So if I searching for an ‘LED lamp’ I would expect an score 0 for an Tape roll :rofl:

cooper

Looking at the result, it’s not that bad lol

But:

  • Preprocess the data: Clean and preprocess the product descriptions to ensure they are in a consistent format. For example, you can lowercase all the descriptions and remove stop words and punctuation.

Looking at these products, could it be improved by removing “noisy” words ?

1: LED 1m stripe red IP63 with power supply → led stripe ip63

here, I don’t think if the customer is looking for a power supply, expect to find a led.

2: LED power supply 12 volts 100 watts waterproof → led 12v 100w
3: Socket strip 6x splash-proof IP44 → strip socket → strip socket ip44
4: Tape roll 9 mm x 50 m silicone paper brown → tape 9mm 50m
5. LED lamp DIMMABLE flat design 5 watts → led 5w

so basically, I would try to build following a certain pattern:

product | size / power / weight

so let’s say you have also a “yellow 20 oz hammer with straight claw”, I would store “hammer 20oz straight claw”

from ChatGPT :

If you are using an embedding technique such as text-embedding-ada-002 to represent products with the format product_name | characteristic1 | characteristic2, you can ensure that product_name has more weight than characteristic1 and characteristic1 has more weight than characteristic2 by concatenating the product’s attributes in a specific order, and then encoding the concatenated string.

For example:

product_name + " " + characteristic1 + " " + characteristic2

By concatenating the product’s attributes in this order, the model will place more emphasis on the product_name attribute, followed by characteristic1, and then characteristic2. When searching for a product, you can then encode the query in the same way and use cosine similarity or another relevant metric to compare the query embedding with the product embeddings stored in Pinecone.

and maybe in your case, something like this would be a good thing to implement: Semantic Search

so you would have two indexes, one ‘product-title’ and the other one ‘product-description’.

Hi @cooper64
Would you mind sharing the Github repo link of your implementation so far.
I am trying to solve a similar use case, but got stuck.
I want to build sales chatbot which would help users find the relevant item based on products (SKU) dataset I have.

Thanks

I don’t use Github-Repo for that little project. At the moment I wrote just a few script snippets in PHP, mostly for API connection via CURL.
The rest of the script is actually only responsible for the data preparation of the product data in order to send it to the OpenAI API and e.g. to get the ‘embeddings’ back or to prepare an JSON file to send the data to Pinecone.
At the moment I’m a bit stuck myself, also because I have other important tasks to do at the moment.
But I will continue soon…
I’ve already thought about a project myself, about an AI chatbot on our shop page. But you also have to develop a few intelligent routines so that this is not misused for nonsense.
After all, such a chatbot should only answer relevant questions in the context of the site, because the ‘davinci’ model in particular costs money.

cooper64