Different output from api and playground Fine-tuning

I’ve made a fine-tuning model which works perfectly in the playground, but when i use my own script it comes out with a way worse result, I’ve no idea why it does it.

Playground example:
User:
Arne Jacobsen Seven chair in black leather, Fritz Hansen, black leather chrome frame prdouced in 2010 good condition few marks and scratches on the leather and a lite patina chrome frame black leather,“Height 30 / Width 20 / Depth 40 / Seat height 44,5 Cm”

Assistant:
Arne Jacobsen Seven chair upholstered in black leather divided with a chrome frame. The chair is from 2010 and is generally in good condition, but with few marks and small scratches on the leather, and a light patina chrome frame.

*Designer: Arne Jacobsen
*Manufacturer: Fritz Hansen
*Model: Seven Chair
*Dimensions: Height: 30 / Width: 20 / Depth: 40 / Seat height: 44,5 Cm
*Materials: Black leather and chrome frame
*Condition: Pre-owned chair from 2010

This is the perfect output, but when i use my own script as shown below:

import csv 
import openai
import pandas as pd 
from dotenv import load_dotenv
import os

load_dotenv()
API_KEY = os.getenv("API")

openai.api_key = API_KEY
model_name = "ft:gpt-3.5-turbo-0613:personal::123123"

input_csv_path = "C:/Users/myname/OneDrive/code/auto-v/csv.csv"
#read the csv file into a dataframe
df = pd.read_csv(input_csv_path, header=None)

for index, row in df.iterrows():
  name = row[0]
  producer = row[4]
  description = row[5]
  measurements = row[6]

  print(name, producer, description, measurements)
  #generate text
  response = openai.Completion.create(
    engine = "text-davinci-002",
    prompt=f"write a product description on the following on this product: {name} {producer} {description} {measurements}",
    max_tokens=1000
  )
  print(response)
  #extract the generated text
  generated_text = response.choices[0].text.strip()

  #append the generated text to the last coulmn
  row[9] = generated_text

#save the dataframe back to the csv file
df.to_csv(input_csv_path, header=False, index=False)

it returns this:
*An ultra-modern classic, the Arne Jacobsen Seven chair is a beautifully designed *
piece that will add a touch of elegance to any home. The sleek black leather and chrome frame are perfect for a contemporary space, and the chair is in good condition with only a few marks and scratches on the leather and chrome. The chair is comfortable and stylish, and would make a great addition to any home.

Can anyone explain why the outputs are so massively different?
The input is the exact same, but I do extract it from a CSV file, but that shouldn’t make a difference

Hi and welcome to the Developer Forum!

Are the replies identical each time? As in, have you ran this test with a temp of say 0.1, 10 times? and compared the results?

(I notice you have no temperature in your API call.)

Currently, the replies are different each time I run the script.

I just tried adding a temp and running a few tests, but I still get completely different returns, which are nowhere the same as the ones I get in the playground, which is the desired output.

Ok , so have you done the same 10 times on the playground?

Would you say that without any room for interpretation the ones from the playground are far and above superior to the ones from the API, or could there be a subjective element of conformation bias slipping in?

I’ve made around 20 tests in the playground now, 10 with a temp of 0.1, and 10 with a temp of 1, all results from the playground were perfect.

I’ve run the same amount of tests with my script and 0/20 returns a satisfactory response.

It seems like the API does not even take the training data into account, the format, etc. is very well formulated in the training data I provided, and the playground does not seem to have an issue recognizing the “pattern”, but the API surely does.

If there is no error in my code that could result in these “bad” responses, I would say that the playground is far superior to the API.

If you have any other suggestions regarding how I could test and improve my results in my script, I would be more than grateful, thank you.

Welcome to the community @Aps2212

Why are you using the text-davinci-002 when you already have a fine-tuned model on your data?

Also the model you fine-tuned exists on the chat completions API and you’re using completions API in your script.

1 Like

I’m quite new to this, I’m sorry if I have made an obvious mistake, what would you suggest I do instead?

Thank you for your reply

My bad for not spotting that at the start :smile: Ahh, don’t feel bad, I also missed it!

meant in reply to Aps2212

1 Like

First read the docs.

Then, use this model with the boilerplate code in the docs shared in my last reply with your prompt.

2 Likes

No worries @Foxalabs

It happens to the best of us.

Thank you, it seems that I fixed the issue based on your advice, I changed the code so it uses chat completion, and my fine-tuning model instead, and now it works fine.

Thank you both for your help @Foxalabs @sps

3 Likes