Response API ignoring system message

Hey everyone,

I’m using gpt-4o with the new Responses API and the tool_choice is web_search_preview, and I’m facing an issue where the model does not persistently follow the system message across turns.

Expected Behavior:

  1. I set a system message:
  • "You must only classify companies by industry in two words. Do not answer anything else."
  1. I ask:
  • "What industry is Tesla in?"
  • Model correctly responds: "Electric Vehicles"
  1. Then, I ask:
  • "Where is Tesla's headquarters?"
  • Unexpected Behavior → Instead of refusing, the model answers the question normally.
  • Any irrelevant question that would require a web search would be answered regardless of the system message.

Issue:

  • Even though the first response follows the system instruction [Note: this also requires a web search], subsequent queries are treated as normal questions instead of sticking to the original rule.
  • I expected the model to deny answering unrelated questions since the system message says it should only classify industries.
  • Even reinforcing the instruction in the user prompt doesn’t work.

What I’ve Tried:

  • Repeating the system message inside each user query.
  • Lowering temperature and top_p to make responses more deterministic.

None of these seem to work consistently.

Question:

Is there a way to force the model to strictly follow the system message even when it performs a web search?

Any help would be appreciated!

It seems that the way the instructions are given in the system message might not be optimal. Please try the following system message:

Your task is to classify companies into industry categories using only two words.
If a user says something that does not match the above task, simply reply, "Please ask only the task of classifying companies into industry categories."

What industry is Tesla in?

Automotive Technology

Where is Tesla’s headquarters?

Please ask only the task of classifying companies into industry categories.

What industry is Galileo Learning in?

Education Services

This should work well!

1 Like

Tip one to attempt to reproduce this: write a system message application not so rigorous. My own system message policy denied asking for web search in the user-side message. The opposite of ignored.

Here’s what I found: with this 1-shot example classifier, it was just impossible to get web search invoked by system message making it mandatory and justifying the use. It took user message “classify with web search:”“”{text}“”“”

Once it was invoked, the free-form schema that could not be broken got broke.

If it was true that the system message was later being ignored or not being placed, and also after a search, in later questions I would not be able to ask gpt-4o deliberately intrusive diagnostic questions:

I think what you might have is:

  • the huge internet search results distract the question,
  • the injected instructions about not repeating search results steal the focus of the task, and
  • few-shot learning of past results can make bad behavior persist.

There is no reason to make a “conversational” classifier with actual API call history. The past user input, divergent responses, and growing chat can affect the quality, where your multi-shot should be improving the quality with your own examples.

Thanks for the suggestion! I actually got this to work when the “Tool choice” was set to auto, but as soon as I switched to web_search_preview, it stopped working. Could you try it on your end and see if you get the same result?

My main goal isn’t just industry classification—I’m really just testing whether a system prompt can guide how the model behaves during a web search. For example, I want to see if I can make it only answer questions on a specific topic and refuse everything else.

Let me know what you find!

When I used the built-in web search tool via the Response endpoint with GPT-4o, I only used the tool parameter and did not use the Tool choice parameter. This was in Playground (since in Playground, the Tool choice parameter cannot be configured).

On the other hand, when I used the GPT-4o Search Preview model in chat completion, I observed the kind of behavior you pointed out, where the system message content was ignored regarding the use of the web search tool. However, it seemed to follow other instructions besides tool usage.

Perhaps, when using the GPT-4o Search Preview model, there is a fundamental rule that web search must be performed when responding to user queries.

It is possible that if the system message (developer message) were to include stricter tool usage criteria in all caps, the results might change. However, it is likely that this model is extremely difficult to control regarding tool usage through system messages.

Although this is not directly related to the API, when I input the same system message I used into ChatGPT’s custom instructions, it followed the custom instructions even with “Search” turned ON.
And this likely indicates that, even when “Search” is enabled in ChatGPT, GPT-4o is still being used.