API Completions not really matching with chat.openAI GPT-3.5 Completions

jakejrucker · July 20, 2023, 7:49pm

Hi Everyone,
We’ve developed a prompt that our API is using for a website application.
When we use this prompt in chat.openai using GPT-3.5, we get a really good completion that we want to see.
But when we use the same prompt with our API, utilizing language model gpt-3.5-turbo-16k-0613, we find that the replies are not as good. More basic, simple, not as high quality.
We are experimenting with the temperature setting, to see if that can help make sure the API coding is matching with the GPT-3.5 completions.
Is there a way to make sure that the submissions we are doing with our API, matches up with the GPT-3.5 UI?
We are uncertain if there needs to be other parameters controlled, outside of temperature.
Any feedback would be appreciated.

_j · July 20, 2023, 8:10pm

ChatGPT does use a gpt-3.5-turbo tuned model to provide its answers. It is likely your inputs to the API that are not identical.

First, you should provide a system role message. Not one that defines your own chatbot and rules to start, but one that emulates the behavior of ChatGPT.

system: “You are ChatBot, a GPT-3 model assistant by OpenAI.”
(one can use the exact text that ChatGPT uses, but that’s not too original, now)

Then, the commands that a user will enter are made as “user” messages.

You will find that your prompts that were designed to twist the behavior of a poor chatbot’s mind into obeying will work similarly.

And then, being the boss of “system”, you can make the AI obey you instead of the user, migrating your instructions in a straightforward language manner into the system message.

Finally, one needs a conversation history ability, sending some past commands back as user/assistant role messages before the latest user input. Without that, the AI can’t answer a question like “but what if I unscrew the third one?”.

A temperature of 0.5 is a good start - as a professional tool, you likely don’t want output that is too unexpected.

jakejrucker · July 20, 2023, 8:42pm

That’s some good input about tuning the prompt through the API to treat it more like a human user. I’ll be requesting our dev team to update how the prompt is being passed, and to play with the temperature more to see if we can get it to mimic.

jakejrucker · July 20, 2023, 11:02pm

One thing we are learning is to manipulate Temperature and P-Value:

Temperature Setting

The “temperature” parameter affects the randomness of the output. A higher temperature (e.g., 0.8) makes the output more random, while a lower temperature (e.g., 0.2) makes it more deterministic. If this parameter is set differently in your API than it is in the submission portal, this could lead to different results.
Update temperature to experiment between 0.0 and 2.0.
API is normally 0.7

Coding
{
“temperature”: 0,
}

P-Value Setting

An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered.
For the API, they generally recommend altering this or temperature but not both.
Update P-Value between 0 and 1.
The default is usually 1

Coding for P-Value 1
{
“top_p”: 1,
}

jakejrucker · July 20, 2023, 11:26pm

Some other minor updates we’re noticing is that the model is important when calling out the API.
Use the other models than gpt-3.5-turbo-16k-0613 (which is an older date but it is locked):

gpt-3.5-turbo – Max Tokens 4,096
gpt-3.5-turbo-16k – Max Tokens 16,384

Because our API calls seem to come in around 3k tokens, we’ll try out the gpt-3.5-turbo and see if that helps resolve the mismatch between the Web Portal and the API portal.

_j · July 21, 2023, 1:41am

gpt-3.5-turbo is simply an alias that points to gpt-3.5-turbo-0613.

The 16k has a larger context length, but is billed at twice the price whether you use that or not. The larger model should not have a large performance divergence until you actually extend your inputs well into its extra length.

jakejrucker · July 21, 2023, 2:09am

Got it, thanks for the clarification here on the alias re-pointing. It’s interesting to try to back into what the OpenAI team has built out, but I’m sure some more testing can get us to a comparable state.

Topic		Replies	Views
ChatGPT and API results are quite different API chatgpt , api	5	3955	December 18, 2023
Why is there a difference in ChatGPT web version vs gpt 3.5 api model (gpt-3.5-turbo /text-davinci-003) API	6	7275	January 21, 2024
Inconsistencies in API response to same prompt and similar content API gpt-4 , gpt-35-turbo , api	3	5023	July 18, 2023
LLM Output Differences: ChatGPT Plus vs API – Context Window & Prompt Adherence API question , gpt-4 , api	2	332	February 18, 2025
Differences between text-davinci-003 model and ChatGpt Prompting	3	6196	February 12, 2023

API Completions not really matching with chat.openAI GPT-3.5 Completions

Temperature Setting

P-Value Setting

Related topics