Why are gpt-4-preview models giving me subpar performance? Please advise

When I ask a question with the following parameters

  1. System prompt that includes instructions in how to answer (a persona)
  2. User prompt = question + retrieved nodes (docs with data to answer question)

The AI is no listening to the persona in the system prompt in the following models:

  1. gpt-4-turbo-preview
  2. gpt-4-1106-preview
  3. gpt-4-0125-preview
    I saw this trend across all preview (latest models with 128K context size) models.

But I see the AI honor my persona very well in the older models for the exact same prompt combination:

  1. gpt-4-0613
  2. gpt-4
  3. gpt-3.5-turbo-16-0613

Is it possible that preview models are not good at honoring the system prompt?

Welcome to the community!

What you’re seeing seems to be in line with what is commonly observed.

  1. the gpt-4 models (0314, 0613) are the actual gpt 4 models. They’re stronger in terms of reasoning and understanding, but more expensive.

    • strengths:
      • instruction following
    • weakness:
      • more prone to hallucinations
  2. the gpt-4-turbo models (1106, 0125) don’t seem to be actual gpt-4 models. Turbo means that they’re faster, cheaper, but also a little less capable in some regards. It looks like they have a wholly different architecture compared to gpt-4 and I wouldn’t consider them an upgrade. They’re something different.

    • strengths:
      • slightly less prone to hallucinations
      • strong adherence to markdown
      • very predictable response format
    • weaknesses:
      • more opinionated
      • worse instruction following

They both have their pros and cons.

Now the system prompt, well, it’s a curious thing.

In all the demos and docs you’ll typically find the system prompt tacked to the front of the conversation.

However, the bigger your document gets, the less relevant your system prompt will become, especially if you tacked it onto the beginning of the conversation. The absolute best way to ensure the model follows your instructions (in my experience) is to tack either a system message or a user message to the very bottom of the conversation telling the model what to do or how to behave.

1 Like

Interesting. I am aware that you can string together multiple ‘user’ and ‘assistant’ messages and pass them, but can you also put in multiple ‘system’ messages? If so, that’s a game changer.

sure!

most of these prompt abstractions are just made up anyways and have no actual programmatic foundation, so there’s a lot of stuff you can do that nobody ever intentioned.

OpenAI is just putting validators on some stuff but as long as you can get past those you can do whatever.

1 Like

I tried it out, but observing in Helicone, it looks like it preserves the order of my user and assistant messages, but pushes all system messages to the beginning (however, still preserving their order). So it looks like we can’t “refresh” the system prompt later in the convo, while still preceding it with the message history. :frowning:

Maybe not what you are looking for, but it seems like this is possible using runs in the Assistants API. The instructions parameter can overwrite the instructions (per/at run - or response, if you prefer), and the additional_instructions (not sure if that wording is 100% correct), well, add to the pre-existing instructions.

This would imply that the instructions (akin to the system message) apply at the response level, though it is unclear where, exactly, this is placed in the order of messages and, moreover, would require the use of the Assistants API, which might not suit your needs.