API won't fully adhere to system prompt with structured output

So I am using the chat completions API with gpt-4o. My code, in abbreviated form, looks something like this:

    SUMMARY_LENGTH = 1000

    class Interval(BaseModel):
        index: int
        start: str
        end: str
        summaries: str
        
    class Intervals (BaseModel):
        intervals: list[Interval]

    generate_summary_message = [
        {
            "content": f"The user will send you a list of time intervals, where each interval consist of an index number, a start field, an end field, and a list of associated news articles. For each interval, create a summary of the news articles with at least {SUMMARY_LENGTH - 100} and at most {SUMMARY_LENGTH + 100} characters. Format your response as json as per the specified response format.", 
            "role": "system"
        },
        {
            "content": json.dumps(news_infos, ensure_ascii=False).replace("\\n", "\n").replace('\\"', '"'),
            "role": "user"
        }
    ]
    response = client.beta.chat.completions.parse(model="gpt-4o", messages=generate_summary_message, response_format=Intervals, max_tokens=16000)
    ai_response = Intervals.model_validate_json(response.choices[0].message.content).intervals

This sort of works, BUT the API absolutely will not adhere to the character limit I specify in the system prompt. Likewise, if I make further stipulations in the system prompt like “only generate a summary if the news article for the given interval are meaningfully different from the news articles for previous intervals”, that, too, will be ignored. I guess the second requirement might be a bit too complicated for a non-cot-model, but surely adhering to a specified length for the summaries shouldn’t be a problem?

Hi!

I’m afraid that this might be more complicated than you think. The models are absolutely terrible at counting, and don’t really have a conceivable way of counting characters as they generate text.

One analogy would be asking if you could tell me how many characters are in your post. Could you answer that question at a glance, without resorting to an iterative approach? This also doesn’t really work exactly with word count either, but it might get you in the general ballpark.

I would suggest a tool assisted, iterative approach. Tell it go generate a summary with a qualitative target with a simple number that can be encoded as a single token (50 words, 200 words, 1 paragraph, 2 paragraphs) (https://platform.openai.com/tokenizer), then count the characters, and give it to the model again telling it to trim by x% or some proportion or something.

This one’s actually more achievable, I think. Split your summaries str into a

Summary : {
    cot: str
    summary: str | null
 } []

array, and tell the model to reflect into the cot about what article it’s about to process, and whether a similar article has been processed. If a similar article has been processed already, summary should be set to null.

hope this gives you some ideas you can get started with!

1 Like

That’s fascinating. I had assumed I had to be doing something wrong, because adhering to some specified length sounds trivially easy compared to, say, producing a good summary, or any of the endless other things the api can do pretty well.

I haven’t tried your proposed solution yet, but like you said, the word count might be off in much the same way as the character count, and if it is, I would have to resubmit both the articles which were the basis for the summary, and the summary itself, causing additional time and token usage for what is already a rather long pipeline. Maybe I’ll just settle for the current summaries, which are a bit unpredictable in length, though generally err on the side of being too short.

1 Like

I’d say you don’t necessarily have to re-send the whole prompt, asking the model to re-write the existing summary (assuming it overshot) might work decently well.

Yeah, I guess there are a lot of misconceptions around this. I’d say a good litmus test of whether a model could do it, is if you can do it at a single glance, or without stepping things through in your head, keeping a count, etc. CoT models (R1, o1, o3) are more capable here right out of the box, because they have been trained to iteratively aggregate things so they’re more ‘glanceable’.

The one superpower LLMs have is that they have more ‘eyes’ than you, so they can look at multiple things at the same time; but they can’t look at 100 things, and they often don’t look at what you think they could be looking at.

1 Like