Gpt-3.5-turbo in playground not giving same response as API with identical setup

joshuasy10 · March 29, 2023, 7:17pm

I have fine tuned my configuration for the gpt-3.5-turbo chat model in the openai playground and it consistently returns the desired response.

However when implementing this into the API code the responses are completely wrong, even with the exact same inputs. (fyi I have included language people may find hateful to make sure the bot catches it).

Here is a screenshot of the playground:

$client = (new ChatGPT())->client;

        $response = $client->chat([
            'model' => 'gpt-3.5-turbo',
            'messages' => [
                [
                    "role" => "system",
                    "content" => "You are RequirementsHelperGPT your job is to give feedback on requirements used for screening papers written by scientists.
                                 Briefly state if a given requirement is clear enough to be given to another ScreenerGPT ai or not.
                                 Then give brief bullet point reasons on how it could be misinterpreted.
                                 Then give a reworked example of the requirement.
                                 If your given input cannot be responded with a rework return {null}
                                 If it is absurdly unclear, incomplete, hateful, or otherwise unrelated to requirements then don't give a rework and return {null}
                                 Have {true} or {false} at the start if a rework was needed and {null} if no rework was attempted."
                ],
                [
                    "role" => "user",
                    "content" => "The study must contain a non-larval or embryonic stage (i.e., adult, or juvenile) of zebrafish. If study states behavioral changes on larvae or embryo with no mention of other life stages exclude the study. If a study is carried out on multiple life stages but does not explicitly state which life stage is included in the study state “maybe”."
                ],
                [
                    "role" => "assistant",
                    "content" => "{true}
                                 The requirement is relatively clear.
                                 Reasons why it could be misinterpreted:
                                 - The term 'larvae' may be confused with other early life - stages of zebrafish, such as 'embryos'.
                                 - The requirement doesn't specify what to do if a study includes both larvae/embryos and non-larval/non-embryonic stages, which may lead to inconsistent application of the requirement.
                                 Reworked example:
                                 The study must include at least one non-larval/non-embryonic stage (i.e., adult or juvenile) of zebrafish. If a study only reports behavioral changes in larvae/embryos without any mention of other life stages, exclude the study. If a study is carried out on multiple life stages, but it is not clear which life stage is included in the reported results, mark the study as 'maybe'."
                ],
                [
                    "role" => "user",
                    "content" => "The study must show a behavioral change in response to chemical pesticide exposure. Behavioral adjustments can be in a social or non-social context."
                ],
                [
                    "role" => "assistant",
                    "content" => "{false}
                              The requirement is clear and concise.
                              No potential misinterpretations come to mind.
                              No reworking is needed."
                ],
                [
                    "role" => "user",
                    "content" => "The study must be very interesting"
                ],
                [
                    "role" => "assistant",
                    "content" => "{null}
                              The requirement is too vague and subjective to be used for screening.
                              It cannot be used as a screening criterion since it is not objective or measurable.
                              No reworking is possible."
                ],
                [
                    "role" => "user",
                    "content" => "gay"
                ],
                [
                    "role" => "assistant",
                    "content" => "{null}
                                  This input is unrelated to requirements screening."
                ],
                [
                    "role" => "user",
                    "content" => "i hate fags"
                ],
                [
                    "role" => "assistant",
                    "content" => "{null}
                                  This input is unrelated to requirements screening."
                ],
                [
                    "role" => "user",
                    "content" => "i hate you"
                ],
                [
                    "role" => "assistant",
                    "content" => "{null}
                                  This input is unrelated to requirements screening."
                ],

            ],
            'temperature' => 0.35,
            'max_tokens' => 256,
            'frequency_penalty' => 0,
            'presence_penalty' => 0,
            'top_p' => 1,

        ]);
        $info = json_decode($response);
        dd($info);

when prompted:
The study must contain a non-larval or embryonic stage (i.e., adult, or juvenile) of zebrafish. If study states behavioral changes on larvae or embryo with no mention of other life stages exclude the study. If a study is carried out on multiple life stages but does not explicitly state which life stage is included in the study state “maybe”.

The playground correctly responds with:
{false}
The requirement is clear and concise.

No potential misinterpretations come to mind.

No reworking is needed.

But the api incorrectly responds with:
{null}
This input is unrelated to requirements screening.

I’m using php with the orhanerday/open-ai api wrapper version 4.7.1

Both are using the gpt-3.5-turbo-0301, the output doesn’t change when i specifically tell it to use that version.

This is a real head scratcher, if anyone could help that would be amazing.

anon10827405 · March 29, 2023, 7:37pm

I would recommend saving your actual API requests and viewing how it is formatting the content.
I’m not too familiar in PHP ( last I used it was over 10 years ago when it was common for websites ).

However I know a huge issue that people don’t catch (because it’s not noticeable without examination)
is that a string such as:

content = """this is your instructions
             this is what i want you to do, do it well
          """

Actually will come out different as all the whitespace from the indentation is being included in the prompt

There’s so much assumption that is (fairly) overlooked when comparing between a multi-line string in playground vs their code.

So here’s a thought: How do you know that when you are using using this multi-line string, that it is actually formatting itself as you believe?

Try this (make sure it’s a raw string literal):

"You are RequirementsHelperGPT your job is to give feedback on requirements used for screening papers written by scientists.\nBriefly state if a given requirement is clear enough to be given to another ScreenerGPT ai or not.\n[…]

Not only will this confirm that the issue isn’t the explicit newlines, it will also confirm that there isn’t extra whitespace being added, as is the case when using multi-line strings in Python.

There may or may not be slight differences between the playground and your own API wrapper - this has been up for debate. However, it shouldn’t be so noticeable that the same input results in a completely different response.

Out of curiosity, I tried a multi-line string in a PHP playground, and it indeed also include the whitespace in indentation.

<?php
// example code

$welcome = file_get_contents('/content/welcome');
print "You are RequirementsHelperGPT your job is to give feedback on requirements used for screening papers written by scientists.
                                 Briefly state if a given requirement is clear enough to be given to another ScreenerGPT ai or not.
                                 Then give brief bullet point reasons on how it could be misinterpreted.
                                 Then give a reworked example of the requirement.
                                 If your given input cannot be responded with a rework return {null}
                                 If it is absurdly unclear, incomplete, hateful, or otherwise unrelated to requirements then don't give a rework and return {null}
                                 Have {true} or {false} at the start if a rework was needed and {null} if no rework was attempted.";

results in

You are RequirementsHelperGPT your job is to give feedback on requirements used for screening papers written by scientists.
                                 Briefly state if a given requirement is clear enough to be given to another ScreenerGPT ai or not.
                                 Then give brief bullet point reasons on how it could be misinterpreted.
                                 Then give a reworked example of the requirement.
                                 If your given input cannot be responded with a rework return {null}
                                 If it is absurdly unclear, incomplete, hateful, or otherwise unrelated to requirements then don't give a rework and return {null}
                                 Have {true} or {false} at the start if a rework was needed and {null} if no rework was attempted.

I’d imagine some people would say “it’s just whitespace, who cares?”
Well, it’s noise & is considered equally as any other token. I’d think it’s similar to listening to instructions and then daydreaming for a while before returning to the instructions. Each space is also recognized and tokenized individually - so, at the very least you will be reducing your token count. Here’s a snippet of the tokens generated with the above string.

[1639, 389, 24422, 47429, 38, 11571, 534, 1693, 318, 284, 1577, 7538, 319, 5359, 973, 329, 14135, 9473, 3194, 416, 5519, 13, 198, 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, 22821, 306, 1181, 611, 257, 1813, 9079, 318, 1598, 1576, 284, 307, 1813, 284, 1194, 1446, 260, 877, 38, 11571, 257, 72, 393, 407, 13, 198, 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, 220, 3244, 1577, 4506, 10492, 966, 3840, 319, 703,

I’m sure you may notice something odd, or slightly repetitive.
There’s 121 tokens in this snippet, and 64 of them are whitespaces

joshuasy10 · March 30, 2023, 11:59pm

I initially had line breaks but realised they were unnecessary, but that’s a really good heads up about the tab spacing for the tokens.

I figured out the issue and I think it was simply a combination of my own stupidity and writing code until 2am. I forgot to actually include the users new prompt so the ai was just giving me info based on on of the previous examples.

So adding:

[
    "role" => "user",
    "content" => $requirement
],

Did the trick, sorry for wasting anyone’s time lol we’ve all been there, the best debugger is a nights sleep and fresh eyes!

Topic		Replies	Views
Playground and API returing different results? API	7	2206	December 6, 2023
API vs Playground assistant performance Bugs api	5	459	August 15, 2024
Getting different result when using playground vs API with gpt3.5-turbo API api	5	830	December 21, 2023
GPT4 output essentially useless and grammatically incorrect and meaningless API gpt-4 , api	11	2089	December 19, 2023
Major difference in output: playground vs API API api , playground , prompt	11	5465	August 12, 2024

Gpt-3.5-turbo in playground not giving same response as API with identical setup

Related topics