Fine-Tuning returned terrible results

I tried to fine tune gpt-3.5 but I must have done something wrong becuase the results are now terrible.

My goal was to build a link classifier so I “manually” classified a lot of links and then built this training data set. (Adding a sample of one line of my jsonl below - full set is 500 lines similar to below).

{"messages": [{"role": "system", "content": "You are a link classifier: given a list of links and paths from fashion brands or fashion stores, you classify each link based on the most likely content they lead to. You categorize links into one of the following categories:\n\n- Products & Categories: This section contains product names, categories, and URL paths related to various products and their categories. Examples include individual product names like \"bracelets\" and \"jackets\", as well as URLs to specific product pages and categories such as \"/products/best-sellers\", \"/collections/sale-skirts\", and \"/collections/sneakers\".\n- Designers: This category focuses on paths and links that are specifically about the designers behind the products or collections. Examples are URLs like \"/designers/chanel\", \"/collections/designer-tom-ford\", and \"/designer-spotlight\".\n- Policy Pages, FAQs, Shipping, Press & Help: This section lists URLs and strings related to policies, frequently asked questions, shipping, returns, press, and help pages of a website. It includes links to pages like \"/policies/refund-policy\", \"/pages/size-guide\", and \"/pages/faq\".\n- Contact & About: Contains URLs related to contact information and about pages. Examples are \"/contact\", \"/pages/about-us\", and \"/pages/our-story\".\n- Locations: Details URLs and strings about the physical locations or stockists of the business. Entries include \"/pages/stockists\" and \"/pages/our-store\".\n- Others: A more miscellaneous category, it captures a wide range of URLs from the website, ranging from cart and checkout pages like \"/cart\" and \"/checkout\", to other types of pages and actions such as \"/blog/\", \"/events\", and \"/services/store-services\".\n- You return the classified content as a dictionary, where the keys are categories and the values are lists of links classified under them.\n"}, {"role": "user", "content": "Classify these links: ['sneakers', '/collections/ceramics', '/collections/kids-footwear', '/collections/lingerie', 'new-arrivals-woman', '/collections/womens-new-arrivals', '/collections/jewelry-earrings', '/c/designers-alice-olivia-cat40360738', '/collections/pedro-garcia', '/c/designers-david-yurman-cat540734', '/collections/amber-sceats', '/collections/shaina-mote', '/c/designer-gift-shops-gucci-gift-shop-cat1696283890', '/collections/maison-louis-marie', '/shipping-FAQ', '/pages/advertising', 'terms-conditions', '/policies/cancellation-policy', '/policies/exchange-policy', '/pages/legal-notice', '/team', '/about/origin', 'real-estate', '/pages/delivery', '/rafi/list/where-to-shop-for-burning-man-in-nyc', '/selectprints/', '/delivery-choices?pr_id=promo_deliverystrip_108&pr_name=deliverystrip&pr_creative=sameday&pr_position=deliverystrip', '/account', '/services/virtual-services']"}, {"role": "assistant", "content": "{\"products & categories\": [\"sneakers\", \"/collections/ceramics\", \"/collections/kids-footwear\", \"/collections/lingerie\", \"new-arrivals-woman\", \"/collections/womens-new-arrivals\", \"/collections/jewelry-earrings\"], \"designers & collections\": [\"/c/designers-alice-olivia-cat40360738\", \"/collections/pedro-garcia\", \"/c/designers-david-yurman-cat540734\", \"/collections/amber-sceats\", \"/collections/shaina-mote\", \"/c/designer-gift-shops-gucci-gift-shop-cat1696283890\", \"/collections/maison-louis-marie\"], \"policy pages, faqs, shipping, careers, press & help\": [\"/shipping-FAQ\", \"/pages/advertising\", \"terms-conditions\", \"/policies/cancellation-policy\", \"/policies/exchange-policy\", \"/pages/legal-notice\"], \"contact & about\": [\"/team\", \"/about/origin\"], \"locations\": [], \"others\": [\"real-estate\", \"/pages/delivery\", \"/rafi/list/where-to-shop-for-burning-man-in-nyc\", \"/selectprints/\", \"/delivery-choices?pr_id=promo_deliverystrip_108&pr_name=deliverystrip&pr_creative=sameday&pr_position=deliverystrip\", \"/account\", \"/services/virtual-services\"]}"}]}

However, when I tried using it has some isssues that I am not fully understanding. For example, when I pass these links

'/p/Clz-ssiumXM/,/p/ClhindYvxM3/,/p/ClrgwZiSGiR/,/p/Clq6UD1psTE/,/p/ClwOc6-JvvW/,/p/ClpqePWr37J/,/p/Clm1UJ7JFQs/,/billystarzana/,/,/p/Clq7RgqJxV_/,/p/ClmH13yvzRb/,boutique@billystarzana.com,/p/Clgp8xAp4d0/,/10206736089920694,/,/p/ClocGzaJvPn/,/p/Cloc36Rpy6g/,/p/ClgpG33JrpT/,/p/ClmI5b2PFWm/,/,/p/Clu0gkNLQqN/,/'

I get this below (which feels random - given, those are weird links but I was expecting more consistency)

{'products & categories': ['/p/Clz-ssiumXM/',
  '/p/ClhindYvxM3/',
  '/p/ClrgwZiSGiR/',
  '/p/Clq6UD1psTE/',
  '/p/ClwOc6-JvvW/',
  '/p/ClpqePWr37J/',
  '/p/Clm1UJ7JFQs/'],
 'designers & collections': ['/billystarzana/',
  '/',
  '/p/Clq7RgqJxV_',
  '/p/ClmH13yvzRb/'],
 'policy pages, faqs, shipping, careers, press & help': [],
 'contact & about': ['boutique@billystarzana.com'],
 'locations': [],
 'others': ['/p/Clgp8xAp4d0/',
  '/10206736089920694',
  '/',
  '/p/ClocGzaJvPn/',
  '/p/Cloc36Rpy6g/',
  '/p/ClgpG33JrpT/',
  '/p/ClmI5b2PFWm/',
  '/',
  '/p/Clu0gkNLQqN/']}

Is it because I overfit? How do I debug something like that?

Hey mate!

welcome to the community forum, I have a hard time understanding where these links are leading based only on the text. How would you expect GPT to know that?

Hi! Thanks for the reply!

Yeah that’s a fair note, and indeed I need to make sure these are present in the training set otherwise I can understand how this is not feasible.

I was curios on why it did not classified them ALL as “others” but only some of them…since they didn’t belong to any category but “others” (but maybe also Other needs to have training data with those present already?).

This is an example of

  • the type of system prompt that you will use in your application and
  • the type of user input that you expect the user to input?

{“role”: “system”, “content”: "You are a link classifier: given a list of links and paths from fashion brands or fashion stores, you classify each link based on the most likely content they lead to. You categorize links into one of the following categories:

  • Products & Categories: This section contains product names, categories, and URL paths related to various products and their categories. Examples include individual product names like “bracelets” and “jackets”, as well as URLs to specific product pages and categories such as “/products/best-sellers”, “/collections/sale-skirts”, and “/collections/sneakers”.
  • Designers: This category focuses on paths and links that are specifically about the designers behind the products or collections. Examples are URLs like “/designers/chanel”, “/collections/designer-tom-ford”, and “/designer-spotlight”.
  • Policy Pages, FAQs, Shipping, Press & Help: This section lists URLs and strings related to policies, frequently asked questions, shipping, returns, press, and help pages of a website. It includes links to pages like “/policies/refund-policy”, “/pages/size-guide”, and “/pages/faq”.
  • Contact & About: Contains URLs related to contact information and about pages. Examples are “/contact”, “/pages/about-us”, and “/pages/our-story”.
  • Locations: Details URLs and strings about the physical locations or stockists of the business. Entries include “/pages/stockists” and “/pages/our-store”.
  • Others: A more miscellaneous category, it captures a wide range of URLs from the website, ranging from cart and checkout pages like “/cart” and “/checkout”, to other types of pages and actions such as “/blog/”, “/events”, and “/services/store-services”.
  • You return the classified content as a dictionary, where the keys are categories and the values are lists of links classified under them.
    "},

{“role”: “user”, “content”: “Classify these links: [‘sneakers’, ‘/collections/ceramics’, ‘/collections/kids-footwear’, ‘/collections/lingerie’, ‘new-arrivals-woman’, ‘/collections/womens-new-arrivals’, ‘/collections/jewelry-earrings’, ‘/c/designers-alice-olivia-cat40360738’, ‘/collections/pedro-garcia’, ‘/c/designers-david-yurman-cat540734’, ‘/collections/amber-sceats’, ‘/collections/shaina-mote’, ‘/c/designer-gift-shops-gucci-gift-shop-cat1696283890’, ‘/collections/maison-louis-marie’, ‘/shipping-FAQ’, ‘/pages/advertising’, ‘terms-conditions’, ‘/policies/cancellation-policy’, ‘/policies/exchange-policy’, ‘/pages/legal-notice’, ‘/team’, ‘/about/origin’, ‘real-estate’, ‘/pages/delivery’, ‘/rafi/list/where-to-shop-for-burning-man-in-nyc’, ‘/selectprints/’, ‘/delivery-choices?pr_id=promo_deliverystrip_108&pr_name=deliverystrip&pr_creative=sameday&pr_position=deliverystrip’, ‘/account’, ‘/services/virtual-services’]”},

{“role”: “assistant”, “content”: “{“products & categories”: [“sneakers”, “/collections/ceramics”, “/collections/kids-footwear”, “/collections/lingerie”, “new-arrivals-woman”, “/collections/womens-new-arrivals”, “/collections/jewelry-earrings”], “designers & collections”: [”/c/designers-alice-olivia-cat40360738", “/collections/pedro-garcia”, “/c/designers-david-yurman-cat540734”, “/collections/amber-sceats”, “/collections/shaina-mote”, “/c/designer-gift-shops-gucci-gift-shop-cat1696283890”, “/collections/maison-louis-marie”], “policy pages, faqs, shipping, careers, press & help”: [“/shipping-FAQ”, “/pages/advertising”, “terms-conditions”, “/policies/cancellation-policy”, “/policies/exchange-policy”, “/pages/legal-notice”], “contact & about”: [“/team”, “/about/origin”], “locations”: , “others”: [“real-estate”, “/pages/delivery”, “/rafi/list/where-to-shop-for-burning-man-in-nyc”, “/selectprints/”, “/delivery-choices?pr_id=promo_deliverystrip_108&pr_name=deliverystrip&pr_creative=sameday&pr_position=deliverystrip”, “/account”, “/services/virtual-services”]}"}]}

This just seems to be a massive misunderstanding about the purpose and use of fine-tune.

It might be, I am trying to ramp up on the topic so I might butchering the tech here. What would you say it is wrong?

To answer your questions: conceptually yes. I have something like 10k links that I am trying to classify based on those categories so I thought that prompt + input was a good combo to get that output (structured classified dictionary)

You are you aware that the model doesn’t have internet access right :sweat_smile:

Fine-tune is for training the AI what kind of output it should generate given a particular input. By having hundreds or thousands of examples of the kind of output to be generated for a user input.

It can thus have a short system identifier because the type of output has been demonstrated by thousands of examples.

good fine-tune example conversation:

system: Classifo classifies links
user: https://help.openai.com/en/articles/5528730-fine-tuning-a-classifier-to-improve-truthfulness
assistant: documentation

bad fine-tune example conversation (see above)

Yeah of course, it should all be based on string recognition. Those link types might not be present in training data and maybe that’s the issue?

can you say a little more of why the above is wrong? I don’t see how the training set/job is fundamentally different from what you are saying.

What you show above is a huge system prompt, which is supposed to be the same kind of system prompt that you would use in your application. You’d have to supply the same huge system prompt to use the fine-tune model, completely the opposite of what fine-tune is for. Your fine-tune is only triggered by the same type of system prompt.

What you show above is a huge user input with a whole bunch of links in a very long list. Are you then going to input a huge list of links every time you use the fine-tune AI?

The for the output, you a massive re-grouping of all those links that are beyond the AIs ability to understand or be trained on. To even generate that is beyond the way that an open-ended transformer works, it would have to produce the text with all of one category until it is done filling that category, and then start producing text for the next category until it thinks it got them all. It then can’t reflect and go back on ones it didn’t produce earlier. Grouping things by classification is a task that AI is terrible at, and you aren’t teaching it anything except perhaps the name of categories we see you already put into a massive system prompt.

Yeah, that was the goal basically

Interesting, are you saying that the transformer doesn’t read all the text first and then outputs results but does it almost on a first in first out? Why would it have issues classifying things?

Noted that those links are just weird so there is no why to know what they are. I would need to add them to training data in some way that is standardized enough for then to recognize the pattern

It is an open-ended word-by-word generation, with no going back to revise what it wrote earlier.

A task that will have bad results:

"group these 100 songs into two outputs, “happy songs” and “sad songs”.

By the time it gets to the end of the production of sad songs and realizes some that haven’t been included yet are actually happy?

I mean, I am clearly doing something wrong . The output here is definelty not what I was hoping for.

Why would all those collections link to be classified as “other”? And the privacy link to be classified as designer. The non tuned model does better actually. How is that possible?

{‘products & categories’: [‘/destinations/washington/seattle/shopping/’],
‘designers & collections’: [‘/california-privacy-rights/’,
‘/interests/food-drink/’,
‘/destinations/washington/seattle/neighborhoods/’,
‘/newsletters/’,
‘/8f88e25d-1e4f-43f3-b2ad-74e039da705b’,
‘/destinations/washington/seattle/hotels/’,
‘/destinations/washington/seattle/attractions/’,
‘/interests/trip-planning/’,
‘/local-experts/’],
‘policy pages, faqs, shipping, careers, press & help’: [‘/privacy-statement/’],
‘contact & about’: ,
‘locations’: ,
‘others’: [‘/interests/arts-culture/’,
‘/destinations/washington/seattle/restaurants/’,
‘12066322877’,
‘/destinations/washington/seattle/’,
‘/collections/mens-laced-shoes’,
‘/collections/womens-small-leather-goods’,
‘/collections/mens-flat-sandals’,
‘/collections/10-corso-como’,
‘/collections/mens-handbags’,
‘/collections/mens-wearable-accessories’,
‘/collections/mens-belt-bags’]}

Because of what I just said. The AI produces a token at a time. There is no “thinking” going on, there is just a certain point where it becomes more likely to produce the closing brackets of a category than to produce the text that continues in a category.

Multi-headed attention, especially in gpt-3.5 that can be tuned, just doesn’t have enough layers to contemplate the entire input at once. It doesn’t have a score-keeping or inner monologue where it can go down the list and first mark them by what category they should go in.

Just turn this into a single input - single output job. It will consume the same amount of tokens after the whole idea of not needing a huge system prompt any more. And then still will likely not work well on different links than the examples you provided, because there is no making sense of /8f88e25d-1e4f-43f3-b2ad-74e039da705b’

If you already have the links and they aren’t changing, and you already classified them, this isn’t a job for AI at all.

Few questions:

  1. What do you mean to have a single input/output job? To classify a link at a time?
  2. Is the model able to generalize that /collections/ (e.g. ‘/collections/bergfabel’,) is a designer link and /collections/ (e.g. collections/sneakers) is a product category page? (granted that at training time i will provide some examples of /collection/name and collection/thing - but not all the possible examples of the population?)
  3. I have a lot of links, only manually classified some. Was hoping to use GPT to do the rest. Maybe gpt 4 can do it?

Here is the task written in a way that AI can perform, at 1/8th the output cost. It will succeed because it is operating on one item at a time from a very deterministic list.

https://chat.openai.com/share/912bbe6b-ab44-4651-8e16-b5d3dc1078a0

It doesn’t even take a system instruction beyond “you are chatgpt”.