Json schema for two level category

I am asking GPT to categorize my info into a two level category. The old descriptive way works fine. It categorizes it into {primary, secondary} categories.
I want to use the new response_format capibility, But I found it very cumbersome to define json schema for a two level category. I even asked GPT, the answer is still lengthy and verbose.
I just get to know Json_Schema, Do you guys know any easy way to define a json schema for two level category?

Primary A, Secondary A1, A2, A3, A4
Primary B, Secondary B1, B2, B3, B4
Primary C, Secondary C1, C2, C3, C4
Primary D, Secondary D1, D2, D3, D4

Hi!

So I tried to emulate a two-level categorization with Structured Outputs - I hope this is helpful to you.

For this example, I emulated Amazon-like two-level product categorization based on a product description. So for some description, you could get e.g. Primary Category == “Electronics”, and Secondary Category == “Mobile Phones”. You can be even more strict with these categorizations by specifying them in the enum field.

This is how I constructed the JSON schema:

json_schema = {
    "name": "Categorization",
    "schema": {
        "type": "object",
        "properties": {
            "results": {
                "type": "array",
                "description": "List of two-level product categorization results",
                "items": {
                    "type": "object",
                    "properties": {
                        "product_id": {
                            "type": "string"
                        },
                        "primary_category": {
                            "type": "string"
                        },
                        "secondary_category": {
                            "type": "string"
                        }
                    },
                    "required": ["product_id", "primary_category", "secondary_category"],
                    "additionalProperties": False
                }
            }
        },
        "required": ["results"],
        "additionalProperties": False
    },
    "strict": True
}

This is then the sample data I used:

product_descriptions = """
Product ID: “P123456789”, Product Description: “Apple iPhone 14 Pro Max, 256GB, Space Gray”
Product ID: “P987654321”, Product Description: “Samsung 55-Inch QLED Smart TV - QN55Q60AAFXZA”
Product ID: “P112233445”, Product Description: “Instant Pot Duo 7-in-1 Electric Pressure Cooker, 6 Quart”
Product ID: “P998877665”, Product Description: “Nike Air Max 270 Men’s Running Shoes, Black/White”
Product ID: “P334455667”, Product Description: “LEGO Star Wars: The Mandalorian The Razor Crest 75292”
Product ID: “P776655443”, Product Description: “Sony WH-1000XM4 Wireless Noise-Canceling Headphones, Black”
Product ID: “P554433221”, Product Description: “Dyson V11 Animal Cordless Vacuum Cleaner”, Primary Category: “Home & Kitchen”
Product ID: “P223344556”, Product Description: “Patagonia Men’s Better Sweater Fleece Jacket, Navy”, Primary Category: “Mens Clothing”
Product ID: “P665544332”, Product Description: “Canon EOS R6 Mirrorless Camera with RF 24-105mm Lens”, Primary Category: “Electronics”
Product ID: “P443322110”, Product Description: “Microsoft Surface Pro 7 - 12.3-inch Touch-Screen - Intel Core i5 - 8GB Memory - 256GB SSD - Platinum”
"""

And finally, this is how I made the call to the API:

response = client.beta.chat.completions.parse(
    model="gpt-4o-2024-08-06",
    messages=[
        {
            "role": "system",
            "content": """You are a product categorization assistant.
                Your job is to perform a two-level product categorization (primary and secondary categories) based on the provided product descriptions.
                For example, a product may be "Electronics" in primary category, and "Mobile Phones" in the secondary category.
                You are to return the categorizations, along with the associated product ID, as per the enclosed scheme.""",
        },
        {
            "role": "user",
            "content": product_descriptions
        }
    ],
    response_format={
        "type": "json_schema",
        "json_schema": json_schema,
    }
)

This results in the following:

pprint.pprint(json.loads(response.choices[0].message.content), indent=4)
{   'results': [   {   'primary_category': 'Electronics',
                       'product_id': 'P123456789',
                       'secondary_category': 'Mobile Phones'},
                   {   'primary_category': 'Electronics',
                       'product_id': 'P987654321',
                       'secondary_category': 'Televisions'},
                   {   'primary_category': 'Home & Kitchen',
                       'product_id': 'P112233445',
                       'secondary_category': 'Kitchen Appliances'},
                   {   'primary_category': 'Footwear',
                       'product_id': 'P998877665',
                       'secondary_category': "Men's Shoes"},
                   {   'primary_category': 'Toys & Games',
                       'product_id': 'P334455667',
                       'secondary_category': 'Building Sets'},
                   {   'primary_category': 'Electronics',
                       'product_id': 'P776655443',
                       'secondary_category': 'Audio Equipment'},
                   {   'primary_category': 'Home & Kitchen',
                       'product_id': 'P554433221',
                       'secondary_category': 'Vacuum Cleaners'},
                   {   'primary_category': 'Mens Clothing',
                       'product_id': 'P223344556',
                       'secondary_category': 'Jackets & Coats'},
                   {   'primary_category': 'Electronics',
                       'product_id': 'P665544332',
                       'secondary_category': 'Cameras'},
                   {   'primary_category': 'Computers & Tablets',
                       'product_id': 'P443322110',
                       'secondary_category': 'Tablets'}]}

You can try to play around with my json_schema above and tweak it to your need. I hope this helps you!

1 Like

This is not what I mean, what I mean is:
Primary Category has to be one of the four values: A,B, C, D
Secondary Category has to based on Primary Category: A0, A1, A2, A3, or B0, B1, B2, B3

Your schema does not specify the enum of primary category, nor secondary category. The catch is: Secondary Category’s enum has to be based on primary category

So in that case you just modify the above schema to include those categories as enum, and you specify this dependency in your prompt. I tried it on my example above and it works.

For your example, I lack the full context, but here is a generic solution (I used “A” and “B” only for primary categories for readability purposes but you can extend this to arbitrary number of categories).

JSON schema:

json_schema = {
    "name": "Categorization",
    "schema": {
        "type": "object",
        "properties": {
            "results": {
                "type": "array",
                "description": "List of two-level categorization results",
                "items": {
                    "type": "object",
                    "properties": {
                        "id": {
                            "type": "string"
                        },
                        "primary_category": {
                            "type": "string",
                            "enum": ["A", "B", "Other"]
                        },
                        "secondary_category": {
                            "type": "string",
                             "enum": ["A0", "A1", "A2", "A3", "B0", "B1", "B2", "B3", "Unknown"]
                        }
                    },
                    "required": ["id", "primary_category", "secondary_category"],
                    "additionalProperties": False
                }
            }
        },
        "required": ["results"],
        "additionalProperties": False
    },
    "strict": True
}

The corresponding API call with the prompt is as follows (NOTE: I recommend always implementing an “exit” strategy with the categorizations, so adding some kind of “Unknown” or “Misc” categorization is helpful).

response = client.beta.chat.completions.parse(
    model="gpt-4o-2024-08-06",
    messages=[
        {
            "role": "system",
            "content": """You are a categorization assistant.
                Your job is to perform a two-level categorization (primary and secondary categories) based on the provided text.
                You are to return the categorizations, along with the associated ID, as per the enclosed scheme.
                Note that the secondary categorization is dependent on the primary categorization, as follows:

                * A
                    - A0
                    - A1
                    - A2
                    - A3
                
                * B
                    - B0
                    - B1
                    - B2
                    - B3
                
                * Other
                    - Unknown
                
                Note that when you are uncertain with the categorization, put it under Other -> Unknown

                """,
        },
        {
            "role": "user",
            "content": <CONTENT>
        }
    ],
    response_format={
        "type": "json_schema",
        "json_schema": json_schema,
    }
)

This should work!

2 Likes

Your example works, but your schema is not tight, does not completely clear with the rules.
The LLM understand your prompt, I got it work before without schema. Now we have response_format, I am just wondering how to use json schema to tight it up. Not to make it work. It worked long time ago with prompt without schema. But occasionally makes some mistakes. Since right now we have the structured output, I am thinking how to use schema to make sure LLM won’t make any mistakes.

from enum import Enum
from openai import OpenAI
from pydantic import BaseModel, model_validator, ValidationError

product_descriptions = ['Blue Box', 
                        'Red Marble',
                        'Blue Marble']


class Color(Enum):
    "The detected color"
    RED = "Red"
    BLUE = "Blue"

class Shape(Enum):
    "The detected shape"
    SQUARE = "Square"
    TRIANGLE="Triangle"
    CIRCLE="Circle"

valid_configs = {Color.RED: [Shape.SQUARE, Shape.TRIANGLE, Shape.CIRCLE],
                 Color.BLUE: [Shape.SQUARE]
                }

class CategoryDetection(BaseModel):
    color: Color
    shape: Shape

    @model_validator(mode='after')
    def validate_configs(cls, values):
        color = values.color
        shape = values.shape

        valid_config= False

        if color in valid_configs.keys(): 
            for config in valid_configs[color]:
                if shape == config:
                    valid_config = True
                    break
        if  valid_config: 
            return values
        else:  
            raise ValueError(f"Not a valid comination of color {color} and shape{shape} ")
        


def test_detect_category():
    client = OpenAI()


    for product_description in product_descriptions:

        try:        
            completion = client.beta.chat.completions.parse(
                model="gpt-4o-mini",
                messages=[
                    {
                        "role": "user",
                        "content": product_description
                    }
                ],
                response_format=CategoryDetection,
            )

            detected = completion.choices[0].message.parsed
            print(f" {product_description}   -> {detected}")
        except ValidationError as ve:
            print(str(ve))


test_detect_category()
Blue Box   -> color=<Color.BLUE: 'Blue'> shape=<Shape.SQUARE: 'Square'>
 Red Marble   -> color=<Color.RED: 'Red'> shape=<Shape.CIRCLE: 'Circle'>
1 validation error for CategoryDetection
  Value error, Not a valid comination of color Color.BLUE and shapeShape.CIRCLE  [type=value_error, input_value={'color': 'Blue', 'shape': 'Circle'}, input_type=dict]

@newoakllc2023 I agree, it’s not 100% tight. It’s still much better compared to pre json_schema. Before, you could “make it work” by specifying the schema in the prompt and enforcing valid JSON. But on the decoding step you would still technically be at the mercy of any one of the 40 000+ tokens in the GPT-4 vocabulary. Now, you are constraining the tokens by heavily biasing the logits according to json_schema and the enum. So now you will only get one of those tokens (i.e. “categories”) you specified, but of course as you say, it’s not super tight, so technically there is some probability of getting e.g. “A” → “B0”. But then you can catch that with validation as @icdev2dev pointed out above.

Another alternative is to have pseudo-hierarchy. In fact I have done that successfully for years in “classical ML” - where I didn’t want to deal with multi-class and hierarchical classification (all the logits exploding on me, hierarchical contrasting hocus pocus and what not). So you would have “_” as a category-level delimiter, and then “A_A0”, “A_A1”, … “B_B0”, “B_B1”, etc.

In your json_schema you define:

...
"category": {
    "type": "string",
    "enum": ["A_A0", "A_A1", ..., "B_B3", "Other"]
},
...

Then you do the category level separation post-response.

Great job! That’s a cool example. You have some great insight into how to use Structured Output for all that it’s been out a week.

5 stars, mate.
:star: :star: :star: :star: :star:

1 Like