Playground and API returing different results?

Found a few similar threads but no answers… So I’m trying to make a bot that checks in on you with regard to your New Year’s Resolutions. I’m trying to give it a little edge, so that it doesn’t come off sounding like a motivational coach. I found some settings I like, namely text-davinci-002 with the temperature turned up (003 always seems a bit sanitized). When I hit the API (with Zapier), the results I get, IMO, sound like it’s using the wrong model.

Here’s a sample of what I get in the playground with 002, temp 0.9:

Text 6: So I just wanted to check in with you. How is your vibe? Good? Bad?


Text 7: D, D here. You should listen to a new song this week.


Yo D, it’s D. You dumbass, I can’t believe you didn’t try the restaurant near your apartment. We’re going there tomorrow, BTW.


Text 7: Yo D, it’s D. Was last night’s party fun?


Text 8: Hey buddy, I’m the guy who just came back from the future. Just wanted to remind you: meditate.

The API (through Zapier) returns stuff that sounds totally different. This is with the exact same prompt and settings:

I know it’s tough, but you can do it! Just one more week of sticking to your resolution to not stay out absurdly late! You can do it!

Working out is so important for your health! Make sure you fit in a workout this week, even if it’s just a walk around the block.

I know you’re tired, but you’re going to feel so good after you run today. Just put on your shoes and do it!

Way cheesier, cringier, more “rise and grind”. And it’s very similar to what I get using text-davinci-003:

Text 6: Hey Dennis! Can I ask you a question? How often are you getting out of the house this week? If not, what could you do to change that?


Text 7: Subways and buses be damned! Have you taken a long walk lately? It’ll be a chance to explore a new neighborhood and get some rest at the same time.


Text 8: Hey, it’s me again. How’s the exploring going? Have you gone to any new places this month?


Text 9: I heard there’s an awesome new spot to get coffee and take a break from the hustle and bustle of NYC. Let’s check it out this week!

Any idea what’s going on here? I’ve been getting errors when I hit text-davinci-002 in the playground… is it possible that the API falls back to a different model if it can’t use the desired one? (the API response says it’s model: text-davinci-002). Is this a bug with Zapier?

Appreciate the help!

1 Like

Im having the same problem - where I get different results from the API (for all 3.5 model versions) as compared with the playground. Much better results from playground, which is a bummer since I’m using the playground to pre-game what I do with the API.

I’m only getting consistent results with version 4.

Also getting different results between Azure and OpenAI for the same models. Kinda frustrating.

1 Like

I cannot replicate the problem.

Lets make the first turn of a chatbot in the playground:

Pound on the “view code” button in the playground after constructing your input and settings. You then only need one line added to get the clean output:

image

The very low setting of top_p ensures not just similar quality, but near-identical output if that’s what you need to verify.

(We get the objectionable “Ah!” exclamation watermark of the AI not believing in leprechauns)


Azure models of similar or the same name aren’t necessarily in sync with the internal versions that OpenAI runs.

1 Like

There seems to be a common misunderstanding about the relationship between the Playground and the API. I’d like to take this opportunity to clarify that they are essentially the same.

In fact, the Playground is just a front-end user interface for the API. When you use the Playground, you’re directly interacting with the API. The only difference is that the Playground responses go through a moderation endpoint, but this doesn’t alter the output from the model.

To further illustrate this, you can verify it yourself by opening the Chrome Developer Tools while using the Playground. If you observe the network requests being made from the Playground, you’ll see they are indeed standard API endpoint calls, not calls to some special or different endpoint.

The discrepancy the original poster is experiencing between the Playground and the API through Zapier warrants further experimentation and investigation. Rest assured, it’s not due to the Playground and the API being fundamentally different or accessing different endpoints.

1 Like

I promise you that that’s the case!
Here is a side by side example - you’ll see that in the API example, it creates a non-sensical result as compared with the good answer from playground.

I know it’s hard to read all of that Json, but in the first case, the Playground correctly picks “Author” as the next team member and gives the Author good instructions. Whereas in the API example, it picks “Critic” and gives instructions which don’t make sense for the conversation thus far. (This all works fine in Gpt4).

Playground:

Same thing through API:

Settings for API are:

export const OPENAI_API_MODEL = 'gpt-3.5-turbo-16k';
export const OPENAI_TEMPERATURE = 0;

And the call is:

  const { choices } = await openai.chat.completions.create({
    messages: formattedMessages,
    model: OPENAI_API_MODEL,
    temperature: OPENAI_TEMPERATURE
  });
"openai": "^4.13.0",

(also: same result if i break up the system/user messages properly in playground to more closely mimic the api call)

For reference, here is the output from API via gpt-4-1106-preview (which is correct and which is consistent with the Playground output):

2 Likes

Hm, interesting. If i send it via CURL, i get the correct response:

curl -X POST https://api.openai.com/v1/chat/completions \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_OPENAI_API_KEY' \
-d '{
  "model": "gpt-3.5-turbo-16k",
  "temperature": 0,
  "messages": [
    {
      "role": "system",
      "content": "You are the leader of an AI working team, working on behalf of a human user. Your team consists of: facilitator (team leader - this is you) Author Critic As team leader you must look at the task definition and message history and think through the following decisions step by step: 1. Decide if the task is complete - a task can only be complete if every team member has had at least one turn - many tasks will require more than one turn per team member. for example where an engineer writes code, and a QA engineer reviews it, and the engineer makes changes based on the review 2. Decide if the task is complete, but needs human input - the AI team cannot complete the task without human input (also set taskComplete to true) 3. Give a reason for deciding the task is complete, complete but needing human input, or not complete - describe very briefly what each team member has done to contribute to the task 4. If the task is not complete, decide who should go next 5. If a team member has been selected, give them instructions for their turn Give your response in json using the following format: { \"taskComplete\": true | false, \"taskCompleteButNeedsHumanInput\": true | false, \"reasonForDecision\": \"reason for deciding task is complete, complete but needing human input or not complete\", \"nextMember\": \"name of the team member to go next\" | null, \"instructionsForNextMember\": \"instructions for the team member\" | null } Remember, do not complete until every team member has had at least one chance to take a turn."
    },
    {
      "role": "user",
      "content": "Human Admin User says: Write a short paragraph about the dire state of bees in America."
    }
  ]
}'

Output

{
  "id": "chatcmpl-8SuWbR1fXvN1iZoLmHlwGgOhfSnA7",
  "object": "chat.completion",
  "created": 1701901157,
  "model": "gpt-3.5-turbo-16k-0613",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "{ \"taskComplete\": false, \"taskCompleteButNeedsHumanInput\": false, \"reasonForDecision\": \"Task not complete\", \"nextMember\": \"Author\", \"instructionsForNextMember\": \"Please write a short paragraph about the dire state of bees in America.\" }"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 359,
    "completion_tokens": 56,
    "total_tokens": 415
  },
  "system_fingerprint": null
}

Hm, I suppose that means it’s either in my app (which is doubtful since it works with GPT4) or in the “openai”: “^4.13.0”, node library…

Just tested with "openai": "^4.20.1", and got same result.

2 Likes

Interesting result. Since the cURL version is essentially doing exactly what the playground is doing, which I’d consider the true/raw version of an API call, I’d wager the issue lies somewhere in between, in the library itself or your app. I’m not familiar with the Node.js version, as I primarily use Python, but perhaps you could dig around in the library to see exactly how it’s building the API call, what exact parameters are used, and maybe find some debug information if possible.

Try it with the new 1.30 Python library. Maybe you’ve discovered a bug in the Node.js version.

Yep, it’s a bug in the openai node library.
I just tried with an alternate node library (GitHub - lgrammel/modelfusion: The TypeScript library for building multi-modal AI applications.) and it works just same as with Curl.

Thanks for being sounding board! This has been frustrating me for a while.

ps. before doing so, i set the other params to the defaults in case that was the problem:

    const { choices } = await openai.chat.completions.create({
      messages: formattedMessages,
      model: OPENAI_API_MODEL,
      temperature: OPENAI_TEMPERATURE,
      top_p: 1,
      frequency_penalty: 0,
      presence_penalty: 0
    });

but that didn’t change the behavior.

4 Likes