Different results: ChatGPT3.5 vs API (gpt-3.5-turbo)

The web chat interface has a hidden initial prompt that sets it up as a chat bot. The API does not. You need to at the very least have such a prompt to get something that resembles the chat interface.

I have am including initial prompts. They often seem to be ignored. The differences seem quite vast as well. Has anyone seen some “hidden prompt” baselines.

Have you had success with this?

The api results are similar to the playground. Do you have any examples of prompting that have been able to close the gap?

I used ChatGPT’s system prompt run on API - with my user prompt that says “no thanks to your system prompt”. I then also ran it multiple times on ChatGPT as the first input.

  • API at temperature 0.5, top_p 0.5
  • ChatGPT with its rather high temperature, (even putting bad grammar in uncommon languages)

Now: which are the API?

There are hints with even more runs that there is a different ChatGPT model. I didn’t run functions to get that differently-trained model, because the function itself could affect the output more.

You will never get exactly the same results. There’s a seeding going on that will alter things every time you submit a prompt.

Not “seeding”, the variations is softmax sampling of probabilities from the certainties predicted by the model as it produces the next word or token, and then also applying unknown temperature and top_p nucleus sampling techniques on the logits.

If I have an open-ended text generation that predicted the next word in the sequence “Seeding in a LLM? Sounds like bull-” would be “puckey” with a certainty of 5% (and maybe other words are higher), then for millions of generations, you see that “puckey” appears in the text in similar distribution as the sampling probability, in about 5% of the cases.

And that’s how its designed, as machine-like “best” text is not as “human”. As I detailed, I applied nucleus sampling with enough distributional probability allowed so that the API generations under my control weren’t all identical texts, but constrained enough that the “intentions” of the AI weren’t overly misdirected by unexpected sampling.

However, you can call it what you like, randomness, creativity, unpredictability, instability…

Upping this, I get the exact same problem as everyone else and sometimes in a dramatic way using the api for some tasks that really shouldn’t be too hard for gpt-3.5-turbo (at least the web chat version manages it quite well).

For context, my use case is to classify whether a short summary of text (~100 tokens) talks about a certain topic or not (e.g. science). I use gpt-3.5-turbo in the api and gpt-3.5 in the web chat. Additionally, in my prompt I specifically instruct that I expect nothing but a yes/no answer.

One weird little thing I’ve noticed so far (but I might need to experiment a bit more with it) is that while the api doesn’t consistently answer correctly, it seems to obey the instruction about the expected form of the answer (it always returns a “yes” or “no”). Whereas in the web chat, the instruction about the yes/no type answer may not always be followed but the answer has always been correct so far.

Any help or explanation would be greatly appreciated here. As it’s already been stated above, I really see no reason not to be able to consistently replicate results from the web app with the api - even if you had to play a bit with the parameters - if they really are the same models used in the backend.
(And yes I know randomness is involved at each invocation of the model so not talking about exactly replicating results, but at least on such basic tasks being able to observe the same precision rates between the two versions)

1 Like

ChatGPT is now using a gpt-3.5 model that they report in the model list as having 8k context. That’s something not available by API.

OpenAI also uses ChatGPT as more of a testing ground (seen more with GPT-4 recently and changing knowledge), so the particular performance you see, and the “which is better” tests that come along with such evaluations, can be you getting a model that is not universal to all users and not on the API.

OpenAI clearly seems to consider gpt-3.5-turbo a “ChatGPT” model, and the only way it is now performative and not degraded from what it once was is to put a ChatGPT style system message in, and let the user do whatever they want.

Finally, while you can constrain the “creativity” in API, ChatGPT has rather high unknown temperature, up to the point of grammatical errors in less common world languages. You’ll not be able to compare its output even to itself.

2 Likes

Thank you! I did just that and got much better results from the API (adding the system prompt used in the web app)

1 Like

Would you please let me know this prompt? Or please give the code example. Thank you

You can retrieve it in the json files that are available for download in the web app > Settings > Data Controls > Export Data
But basically it’s this :

SYSTEM_PROMPT = f"""You are ChatGPT, a large language model
trained by OpenAI, based on the GPT-3.5 architecture.
Knowledge cutoff: 2021-09
Current date: {datetime.now().date()}"""

And then this goes into your messages for your request like so:

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": your_prompt},
]

I guess ChatGPT has been trained with this system prompt so it kind of “needs” it to activate a lot of its knowledge/functionalities. Or at least something similar if you really want to change it.

3 Likes

Thanks a lot! I will try tomorrow to test with non-working prompts.

2 Likes

Same issue here. API returns extremely poor and inconsistent responses compared to Chat. No matter how clear your instructions are. Which in fact makes API unusable at all. Hope the team gets aware of that one day.

Can you give an example prompt that produces an acceptable result in ChatGPT but does not in the APi? It would be very helpful to see this.

I takes place very often though. Let me show you an example.

Here is what provided by the API (which is absolutely nonsense)

(My message continues below)

Compared to what the ChatGPT showed me:

It is very disappointing as we need to pay more for the API but get a much poorer result!

The response you see from the playground is exactly what someone programming the API as a classifier would want to see.

You do not show the playground system prompt in your screenshot. You must tell the AI that it is a chatty companion that explains things, or use a triggering phrase for that behavior such as “you are chatgpt, a helpful AI assistant chatbot”.

You tried it?

I added the chatty part you said but no difference when I did!

Let’s go for the same results:

Same style, different answer.

So let’s see if we improve it with different system message AI programming.

Got it. Thank you!

I think it may work in the future!