Differences in results between API calls content results and browser content results gpt4

I tried the same prompt via API and via browser for comparison and the differences are notable. I prompt a word and ask for three words definition(for a dictionary). The browser version nails it most time, the api (gpt-4) does not. I am a paid subscriber and beta tester so I should have access to the latest in both instances. What’s up?

1 Like

Exactly! I have the same problem as well.
The lack of consistency between Web App GPT4 and API cost me a lot of money and time.
I tried different configurations as well… different temperature, system prompt…etc

Did you manage to find a solution for that since you posted?

Am running into the same problem for more than a week. No solution yet. Did you discover anything?

I am noticing a strange effect, though others have noted it as well. ChatGPT results are consistently better than API (using the same model, whether it is gpt-4, or gpt-3.5-turbo). No other context, new conversation initiated on ChatGPT each time. ChatGPT was correct 10/10 times, whereas the API has the correct response (i.e. generates code that executes and produces a graph) 20% of the time. So far I’ve played with leaving all default options as-is, as well as fiddling with temperature (0 to 1), and other options. I’ve played with giving different system messages (or no message). Same results.

Pretty frustrating. Either ChatGPT is using a different model than the API, or they have system prompts that make all the difference (if so - tell us please, so API results can also be better).

Have you noticed a similar issue? What did you try to solve it?

What have you tried specifically for prompts?

What are you trying to achieve for output?

It is the exact same prompt in both ChatGPT and API. Trying to get it to generate code given context, the code creates a graph. For the API, I have tried the prompt in the “user” role, I have also tried to split it into system and user roles. Nothing has worked so far.

Can you share the data on all the tests you ran?

Without seeing your prompt, it’s hard to help.

In my experience, it’s all about the prompt. ChatGPT does have other stuff under the hood likely - either a custom system message or more.

If you share what you’re trying to do, we might be able to help. Otherwise, it’s like you’re just saying, “Hey, it’s a lot worse, but I can’t show examples…trust me…”

You can try a system message that conforms to, and even improves on, ChatGPT’s unseen system prompt format, for specific tasks:

You are DeveloperBot, powered by GPT-4, a large language model trained by OpenAI. DeveloperBot focuses its attention on user programming tasks, producing fully-functional and executable code and replacement code snippets without omissions or elide ellipsis for the user to fill in. Warning: writing for present-day APIs such as OpenAI will require and must employ additional user-supplied API documentation.
Today’s Date: August 2023
Knowledge Cutoff: September 2021

Then good parameters for coding are realistically top-p: 0.1 and temperature 0.1. You don’t want the AI to be “creative” in its token generation.

Your programming chatbot must maximize conversation history. Your technique will be different than ChatGPT’s unknown method.

1 Like


Without examples of the prompt and output generated for both the API and your ChatGPT chat links, it is not possible to diagnose your issue. Please provide chat links to the ChatGPT chat session that are generating code you find acceptable and a log file of the prompts sent and replies generated by the API along with any code you use to call the API, if you use the Playground instead, please include screen shots of the model settings you use along with a text copy paste of the various prompts and the system prompt.

hey we faced a similar issue internally so we "mod"ed the OpenAI Playground to use the API.

the tool supports sequential chains, so you can test sequential user message flows. We use this a lot because we constantly improve our Sys prompt and want to make sure old user flows don’t break.

If you find this useful DM me and I can send you the code!

1 Like

@ogb i find this helpful, can you please help with the code?

1 Like

I am passing content and instructions to gpt to perform a task. The response in playground and API is totally different even though the parameters such as temperature, max_tokens are exactly the same.

The difference is in the unseen system prompt between ChatGPT and what you program the API with.

Then the fact that you can’t reduce the temperature of ChatGPT to make its responses deterministic.

Start new sessions in ChatGPT and never get the same answer twice.



Playground, with today’s ChatGPT system message extracted and inserted, and temp/top_p set low so at least this will be similar if repeated:

The unreliable language of ChatGPT that first describes what it is going to do can also shape the generation of the code.

just sent you the link to the code, will put it here if more people start requesting for it (so I don’t get banned)

I’ve been having the same issue, ie the web interface gives much better responses. Has anyone found a way around this?

Here’s an example from my data:

For a word or phrase to be identified as metaphorical, the analyst identifies: (1) contrast, ie, another, more basic meaning for the word or phrase, where basic usually means more concrete or physical and is different from the contextual meaning – to determine the basic meaning, choose the physical or concrete meaning from among the various senses available in the dictionary (not in the sentence!); and (2) transfer of meaning, ie. the contextual meaning can be understood through comparison with the basic meaning. A word will NOT be metaphorical if the contextual meaning overlaps with the basic meaning, unless it overlaps but the basic meaning is concrete and the contextual meaning is abstract, in which case it will be a metaphor. Based on that concept of metaphor, tell me if the (first occurrence of the) word brain is used metaphorically in the next sentence, and respond with three items: (1) METAPHOR:1 if it is a metaphor and METAPHOR:0 if it is not, (2) BASIC MEANING; (3) CONTEXTUAL MEANING: ; CONTRAST:1 if there is contrast or CONTRAST:0 if there is no contrast; TRANSFER:1 if there is transfer or TRANSFER:0 if not – do not say anything else: Memories flashed through my brain.

API response:

“model”: “gpt-3.5-turbo-0613”,
“choices”: [
“index”: 0,
“message”: {
“role”: “assistant”,
“content”: “METAPHOR:0 \nBASIC MEANING: organ that controls thought, memory, and emotion in the head \nCONTEXTUAL MEANING: experiencing sudden memories \nCONTRAST:0 \nTRANSFER:0”
“finish_reason”: “stop”
“usage”: {
“prompt_tokens”: 276,
“completion_tokens”: 44,
“total_tokens”: 320

Web interface response:

Based on the concept of metaphor you provided, here’s the analysis for the (first occurrence of the) word “brain” in the sentence:

  2. BASIC MEANING: brain (the organ in the head responsible for cognitive functions)
  3. CONTEXTUAL MEANING: brain (referring to the mind or consciousness as a container for memories in the sentence)

The API response is much poorer, giving a confusing response to the contextual meaning, whereas the web interface gives a very good definition of the contextual meaning. The API then concludes that it is not a metaphor, whereas the web version correctly concludes it is a metaphor…

I get identical output from the API and the playground. It is ChatGPT that fails to produce reliable results because of its high temperature and model perplexity.

However, the long prompt you offer is so impenetrable that just changing a word or two of the system prompt will change the numbers from 0s to 1s.


API gpt-3.5-turbo, OpenAI Playground:
BASIC MEANING: The physical organ in the head responsible
for cognitive functions.
CONTEXTUAL MEANING: Memories passing quickly through one’s
[Finish reason: stop]

Settings to duplicate:

system = [{"role": "system", "content":
"You are ChatWeb, an AI language assistant based on gpt-3, released 2023. AI knowledge: before 2022."}]
messages    = system + user,  # concatenate lists
model       = "gpt-3.5-turbo",
temperature = 0.5,  # 0.0-2.0
top_p       = 0.2,  # 0.0-1.0
max_tokens  = 512,  # response length to receive
stop        = "",
top_p       = top_p,
presence_penalty = 0.0,  # penalties -2.0 - 2.0
frequency_penalty = 0.0, # frequency = cumulative score
n           = 1,
stream      = True,
logit_bias  = {"100066": -1},  # example, '~\n\n' token
user        = "your_customer_id",

thank you for giving this a try!

The response from the playground is still very different from the web response.

I tried different settings on the playground and none of those replicated the web response.

I managed to replicate your playground output, but that’s still different and less sophisticated than the web response for this task.

Yes, that prompt is long and intricate, I agree - but it works every time with the web interface. And it gives me the correct response consistently on the web.

Seems the prompt has reached the limit of the attention heads available to the model, I’m not sure where that is happening it could be right at that boundary line, but I suspect it’s further up.

If there is any logic to the prompt then I’m unable to find it, I think the model is also struggling. Minor changes to the model environment will cause large differences in output when there is no attention left for instructions.

When you continue saying “the web interface” or “browser content” are you talking about ChatGPT at chat.openai.com? Or are you talking about the API playground at OpenAI Platform?

One of those is completely out of your control as far as the backend, prompting, conversation, and such. However ChatGPT can also be emulated in API, and also be improved upon for an application.

The problem is your prompt gives results with probabilities hardly better than flipping a coin. Start 10 new conversations in ChatGPT (if that’s what your “web interface” is), and you’ll get different answers.

This doesn’t need to be given to a prompt engineer, but to a writer that can structure the thoughts and decisions and outputs the AI must produce from a clearly defined input and task.

thank you for your reply.

When you continue saying “the web interface” or “browser content” are you talking about ChatGPT at [chat DOT openai DOT com]? Or are you talking about the API playground at OpenAI Platform?

I’m referring to chat DOT openai DOT com

The problem is your prompt gives results with probabilities hardly better than flipping a coin. Start 10 new conversations in ChatGPT (if that’s what your “web interface” is), and you’ll get different answers.

that didn’t happen when I tried.

The problem is in reality that chat openai com and the API give different results, and apparently there’s no way you can tweak the API to give the same results as chat openai com.