API Completions not really matching with chat.openAI GPT-3.5 Completions

Hi Everyone,
We’ve developed a prompt that our API is using for a website application.
When we use this prompt in chat.openai using GPT-3.5, we get a really good completion that we want to see.
But when we use the same prompt with our API, utilizing language model gpt-3.5-turbo-16k-0613, we find that the replies are not as good. More basic, simple, not as high quality.
We are experimenting with the temperature setting, to see if that can help make sure the API coding is matching with the GPT-3.5 completions.
Is there a way to make sure that the submissions we are doing with our API, matches up with the GPT-3.5 UI?
We are uncertain if there needs to be other parameters controlled, outside of temperature.
Any feedback would be appreciated.

1 Like

ChatGPT does use a gpt-3.5-turbo tuned model to provide its answers. It is likely your inputs to the API that are not identical.

First, you should provide a system role message. Not one that defines your own chatbot and rules to start, but one that emulates the behavior of ChatGPT.

system: “You are ChatBot, a GPT-3 model assistant by OpenAI.”
(one can use the exact text that ChatGPT uses, but that’s not too original, now)

Then, the commands that a user will enter are made as “user” messages.

You will find that your prompts that were designed to twist the behavior of a poor chatbot’s mind into obeying will work similarly.

And then, being the boss of “system”, you can make the AI obey you instead of the user, migrating your instructions in a straightforward language manner into the system message.

Finally, one needs a conversation history ability, sending some past commands back as user/assistant role messages before the latest user input. Without that, the AI can’t answer a question like “but what if I unscrew the third one?”.

A temperature of 0.5 is a good start - as a professional tool, you likely don’t want output that is too unexpected.

1 Like

That’s some good input about tuning the prompt through the API to treat it more like a human user. I’ll be requesting our dev team to update how the prompt is being passed, and to play with the temperature more to see if we can get it to mimic.

One thing we are learning is to manipulate Temperature and P-Value:

Temperature Setting

  • The “temperature” parameter affects the randomness of the output. A higher temperature (e.g., 0.8) makes the output more random, while a lower temperature (e.g., 0.2) makes it more deterministic. If this parameter is set differently in your API than it is in the submission portal, this could lead to different results.
  • Update temperature to experiment between 0.0 and 2.0.
  • API is normally 0.7

Coding
{
“temperature”: 0,
}

P-Value Setting

  • An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered.
  • For the API, they generally recommend altering this or temperature but not both.
  • Update P-Value between 0 and 1.
  • The default is usually 1

Coding for P-Value 1
{
“top_p”: 1,
}

Some other minor updates we’re noticing is that the model is important when calling out the API.
Use the other models than gpt-3.5-turbo-16k-0613 (which is an older date but it is locked):

  • gpt-3.5-turbo – Max Tokens 4,096
  • gpt-3.5-turbo-16k – Max Tokens 16,384

Because our API calls seem to come in around 3k tokens, we’ll try out the gpt-3.5-turbo and see if that helps resolve the mismatch between the Web Portal and the API portal.

gpt-3.5-turbo is simply an alias that points to gpt-3.5-turbo-0613.

The 16k has a larger context length, but is billed at twice the price whether you use that or not. The larger model should not have a large performance divergence until you actually extend your inputs well into its extra length.

1 Like

Got it, thanks for the clarification here on the alias re-pointing. It’s interesting to try to back into what the OpenAI team has built out, but I’m sure some more testing can get us to a comparable state.