Force GPT 3.5 Turbo to choose an answer from a set of predefined options

Hello everyone!

I am trying to use the API to help me categorise posts. I have a website with over 100k posts and I am migrating it. While migrating it I want to better categorise everything. On my current website I have around 5 categories, now I want to make them 25.

I try to use the system message to set the task and write the categories to choose from and I pass the topic in the user content. It works relatively well but the issue is that sometimes it creates random categories which are a combination of all the categories I have given, sometimes it even creates new ones that are not listed by me.

I tried a few ways to structure the task. First I tried by saying that for example we have a main category named Europe and subcategories of it named “Politics”,“Economics” and others. I found out that this doesn’t work well and instead I passed them like this “Europe > Politics”, “Europe > Economics” without stating that they are subcategories and it worked better.

Can someone help me on better building my prompt?

One more thing - using GPT 4 works perfectly (much slower) but it is very expensive for my budget. I know that things can’t be 100% accurate.

Any guidance would be highly appreciated!

Thank you!

At first look this should be achievable with gpt-3.5-turbo.
Do you mind to share your prompt?

Hi!

Did you take a look at embeddings for classification?
It’s fast, cheap and there is an example in the docs that you can reference.

https://platform.openai.com/docs/guides/embeddings/use-cases

I am not sure if I understand correctly but this type of classification would not be of help for me. The example shows that it links lets say 1-3 star ratings to negative and 4-5 ratings to positive. This task seems to be much more simple because you have a fixed answer. For me - the algorithm should choose which category to use on its own.

Hello,

I will paste the code here. The messages are in Bulgarian language but I will translate them for you to understand them here.

# -*- coding: utf-8 -*-
import openai
import pandas as pd

openai.api_key = '***********'

categories = [
    "Elections", "Bulgaria > Politics", "Bulgaria > Economy", "Bulgaria > Society",
    "Bulgaria > Justice", "Bulgaria > Crime", "Bulgaria > Incidents", "Bulgaria > Culture",
    "Europe > Politics", "Europe > Economy", "Europe > Society", "Europe > Justice",
    "Europe > Crime", "Europe > Incidents", "Europe > Culture", "World > USA and Canada",
    "World > Russia", "World > Middle East", "World > Asia", "World > Latin America",
    "World > China", "War", "Sport > Football", "Sport > Basketball", "Sport > Volleyball",
    "Sport > Rugby", "Sport > Motor Sports", "Sport > Baseball", "Sport > Others",
    "Lifestyle > Fashion", "Lifestyle > Gossip", "Lifestyle > Curious", "Lifestyle > Recipes",
    "Science and Technology > IT", "Science and Technology > Space", "Science and Technology > Artificial Intelligence",
    "Science and Technology > Others", "Health"
]


def get_openai_response(content, retry=False):
    try:
        messages = [
            {"role": "system",
             "content": "You are an assistant who needs to categorize news headlines according to the following categories: Elections, Bulgaria > Politics, Bulgaria > Economy, Bulgaria > Society, Bulgaria > Justice, Bulgaria > Crime, Bulgaria > Incidents, Bulgaria > Culture, Europe > Politics, Europe > Economy, Europe > Society, Europe > Justice, Europe > Crime, Europe > Incidents, Europe > Culture, World > USA and Canada, World > Russia, World > Middle East, World > Asia, World > Latin America, World > China, War, Sport > Football, Sport > Basketball, Sport > Volleyball, Sport > Rugby, Sport > Motor Sports, Sport > Baseball, Sport > Others, Lifestyle > Fashion, Lifestyle > Gossip, Lifestyle > Curious, Lifestyle > Recipes, Science and Technology > IT, Science and Technology > Space, Science and Technology > Artificial Intelligence, Science and Technology > Others, Health. You are not allowed to change the categories I have given you. Answer only with the most appropriate category for the headline. Here are some guidelines for the categories - Elections – news about elections in Bulgaria, Bulgaria > Politics – political news about politics and politicians in Bulgaria, Bulgaria > Economy – economic news about the economy of Bulgaria, Bulgaria > Society – social news about society in Bulgaria, Bulgaria > Justice – justice news about justice in Bulgaria, Bulgaria > Crime – criminal news about Bulgaria, Bulgaria > Incidents – incidents and accidents in Bulgaria, Bulgaria > Culture – cultural news about Bulgaria, Europe > Politics – news about European politics and politicians in Europe, Europe > Economy - news about the economy of Europe, Europe > Society – social news about society in Europe, Europe > Justice – justice news about justice in Europe, Europe > Crime – criminal news about Europe, Europe > Incidents – incidents and accidents in Europe, Europe > Culture – cultural news about Europe, World > USA and Canada – news about America and Canada, statements by American and Canadian politicians, World > Russia – news about Russia, statements by Russians, World > Middle East – news about countries in the Middle East, World > Asia – news about Asia, World > Latin America – news about Latin America, World > China – news about China, War – war news about the whole world and all military conflicts and actions"},
            {"role": "user", "content": content}
        ]

        if retry:
            messages[0][
                "content"] += " Be sure to follow the categories I have given you, I do not want to receive an answer different from them."

        response = openai.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=messages,
            max_tokens=20
        )

        category = response.choices[0].message.content

        if category not in categories and not retry:
            print(f"Error! The category is {category}, retrying!")
            return get_openai_response(content, retry=True)

        return category
    except Exception as e:
        print(f"An error occurred: {e}")
        return None


def process_excel(file_path):
    df = pd.read_excel(file_path)

    for index, row in df.iterrows():
        content = row['title']  
        if pd.notna(content): 
            response = get_openai_response(content)
            print(f"Response for row {index}: {response}")  statement
            df.at[index, 'categorygpt'] = response  

    df.to_excel(file_path, index=False)
    print(f"Updated Excel file saved to {file_path}") 


if __name__ == "__main__":
    file_path = '***************************'
    process_excel(file_path)

It started working relatively well like this. It keeps making mistakes tho. I ran it with a 50k row file and it started with around 40 lines per minute, I left it overnight and currently it makes 1 row per minute. Can you advise me on that too?

I reiterate @vb’s suggestion that you should really look into embeddings for this, especially if you are interested in a quick and cheap approach.

Have a look at this thread, which also provides some high level guidance on how to achieve this for a categorization task:

The other option that will likely yield very good results - although it is definitely more expensive than an embedding option - would be a fine-tuned gpt-3.5 model. I have used this approach quite a bit in the past and generally yielded very good results. That said, embeddings really are the best all-around solution.