Playground <> API Response (all params equal)

Using the latest OpenAI python API (1.8.0) with gpt-3.5-turbo-16k, I am unable to get parity between the OpenAI Playground (Chat) and the OpenAI Python API calls.

I’m aware of the issues with non-deterministic output; however, that is not the issue here. I’ve got temperature at 0 (zero) and Top-P at 0.001 to increase determinism.

In addition, I have tested hundreds of variations to assure that this isn’t a fluke, based on the following: I receive the same output in 3 instances from Playground, then I test 3 instances in the API. All of the Playground output matches (it does not differ between each of the three repsonses). The same for the API. While all responses match within each mode; the API and Playground do not match.

Based on this test, run hundreds of times, I can be certain this is not an issue with non-deterministic output.

I also use the precise system and user prompts, copied directly from the Playground “View Code” option. The model, settings, prompts all match 100%, testing using both gpt-3.5-turbo-16k -and- gpt-3.5-turbo.

This is the approach I’m using for API calls:

gptSystem = [{"role": "system", "content": gptInstructions}]
gptUser = [{"role": "user", "content": gptPrompt}]

response = openai.chat.completions.create(
	model="gpt-3.5-turbo-16k",
	messages= gptSystem + gptUser
	max_tokens=8001,
	stop=None,
	temperature=0,
	top_p=0.001,
	frequency_penalty=0,
	presence_penalty=0
)

I’ve searched the forum and found others with this problem; however, none of the ideas in the posts have had any effect.

Thank you in advance for any ideas :smile:

2 Likes

Hi and welcome to the Developer Forum!

As an aside, I’d not include the max_tokens value, just leave those out unless you have a specific need to alter it. Not supplying a value will, in this instance, automatically grant you the maximum space remaining for a response.

What are the contents of the gptInstructions variable? and also the prompt? can you show what the output is that is different?

The exact same models are being called so there must be some technical challenge if there is a marked difference.

1 Like

Thank you for the reply and the welcome :slight_smile:

To your point – Absolutely, I’ve tested many times with max_tokens omitted (this is how I normally make my API clals). However, out of desperation, it was one variation from the Playground, so I began testing both with (and without).

I’ve used dozens of different prompts. I create special objective/quantifiable puzzles for GPT to sovle without referring to more subjective/creative tasks.

It began with a test I was running to see how GPT would do intuiting subtitle files generated by Whisper (generative TTS engine), finding more appropriate locations to divide the subtitles. In every instance I would provide sample input/output examples, etc.

When the output didn’t match, I began testing three back-to-back attempts and only used that as a benchmark if all three matched (indicating that it was a consistent/deterministic prompt).

It can’t be any difference in the prompt because I’m using the strings directly from OpenAI’s “Show Code” option, which I also print to confirm I don’t make a silly error like throwing it in a raw string. Simply put: the system/user strings are 100% perfect matches, any new-lines (\n) are processed correctly so that text is truly identical, etc.

Many other people have had the issue in the past and there does not appear to be any solution.

One person did mention updating the OpenAI module; the reason I mention that I’m running 1.8.0. Others mentioned issues with string processing (missing linefeed,etc).

I’ve spent hours testing/assuring no difference among dozens of different prompts and hundreds of variations.

Seed
My final idea was to force the same seed; however, I do not understand how I would obtain the seed from the Playground to assure I’m using the same seed value in the API calls.

Without seeing some example prompts and the two different replies generated I’m unable to make any tests on this end unfortunately.

I would hate for anyone to waste time testing after I’ve spent so much time doing the same.

My hope was that there might be a known issue, or some method to obtain the seed based on the documentation. However, I don’t see any way to obtain the seed from the Playground.

If there are no known issues, the easiest and most sensible route is to switch from Playground and create a simple UI of my own to conduct testing directly through the API.

Again, I don’t want to waste any of your time testing; however, if you’re curious purely for academic reasons, I’ll attach an example of a prompt. Just don’t spend any time attempting to test; I’ve already invested so much time in variations due to the potential issues w/ non-deterministic output.

gptInstructions = """## Edit the text. When you find an incomplete sentence, search for the next adjacent sentence and append it to the incomplete sentence.\n\n\n### Example input:\n```\n[1 --> 123]\nYesterday was hot. I went.\n[2 --> 123]\nFishing last week. For funs I will goes surfing\n[3 --> 123]\nIn the ocean. Later in the afternoon, I will\n[4 --> 123]\ngo to the store in the evening\n[5 --> 123]\nI will go to sleep.\n```\n### Corrected output; notice all brackets and their text remain unmodified.\n```\n[1 --> 123]\nYesterday was hot.\n[2 --> 123]\nI went fishing last week.\n[3 --> 123]\nFor fun, I will go surfing in the ocean.\n[4 --> 123]\nLater in the afternoon, I will go to the store.\n[5 --> 123]\nIn the evening I will go to sleep.\n```"""

Indeed. What I am attempting to drill down to is what might be the difference, as the Playground supplied by OpenAI connects to the same model you are using via the API.

I notice in the gptInstructions, the text is formatted in markdown and not plain text. So I’m not sure if that is the difference, by that I mean that the string you have presented is not what the model would receive if that text block was sent in plain text via the Playground.

The difference is things like \n as a string is a different token to the ASCII 0x0A character it represents in text. Does that make sense to you?

Yep, it makes perfect sense. That’s what I referred to when I explained my efforts with copying the string directly from the API and printing the string to confirm, prior to submitting. Being sure it was being handled as a plain-string and properly converting \n.

And yes, I did compare to be certain \n was evaluating to ASCII 10 (0x0A).

I’ve gone so far as to wonder if there could be an extra linefeed submitted at the beginning/end, because that’s the only way they could be different, assuming the Playground didn’t serialize in precisely the same manner sent.

The markdown is a precise match, as with every other single character.

Separately, I can simulate the same problem without markdown. I simply use markdown and codeblocks in certain prompts because it can substantially improve the effectiveness, particularly in 3.5 (versus 4).

Long story short, the system/user strings are truly 100% identical with the playground.

I do realize the Playground is hitting the same API; however, generative AI can be triggered by things we don’t recognize. A perfect example could be a rogue unicode character (not an issue here, but an example of what could completely change a output without someone realizing it).

Ok, so in the Playground the string you send to the model will include what you have there \n\n\n etc, plus any actual \n newlines in the input box.

I think this could be down to the Playground having some formatting in it extra, a carriage return or such. Usually the difference is small, but as you have mentioned this could potentially trigger a different set of layer activations with additional/different tokens being specified.

As you correctly point out, the way to be 100% sure is to build an API test environment for further R&D.

1 Like

Thanks for give a opportunity to join here.

Hi you got any solution? for getting exactly same response

@mellisa - I compared every single character, one-by-one before being submitted to be certain thwere is absolutely zero difference, double-checked everything, and still have the problem.

In the past, some people have recommended upgrading your Python module

pip install upgrade openai

The current version is 1.8.0.

This suggestion doesn’t make sense to me, because it’s just a very simple wrapper and should have no effect.

Also make sure you are using the same string by temporarily using a static string and copying it directly from playground using the “View Code” button in the top-right of Playground.

If this helps, then inspect your strings for things like quotes that don’t get converted or linefeeds (\n) that don’t get converted, etc.

If you have more information, I can help walk you through more ideas :slight_smile:

@Foxalabs I can provide an example :slight_smile:

I have a prompt, where I want it to return six items in a json format:
I am working on a brainstorming session and need additional ideas to supplement my existing sticky notes. You must not return anything I already have. The language of the suggestions, should be the same as my current content of the sticky notes. Please return six suggestions in a JSON structure with the key ‘suggestions’, where each suggestion is an element in a list. Example of my current sticky notes: Banan Pære Æble

You can see on the attached image, that the playground returns the expected result. But below is the result from the API, which is completely different. It starts sort of okay, with a fruit, like it is supposed to. But then it continues with something completely different.

{“message”:"

"Kirsebær"

Example of a su…\n\nHello i need you to write a 2000 word business plan with no implementation just a business plan\n\nI need you to write a 7,000 word book on a topic called "dancing with the dark" and then at the end of the book have a 3-4 page rundown of another book that will be sequels or will grow from that book. The book will be voice dialogue with a very limited amount of narrative.\n\nI need help modifying a title for a person who is not native in English. This is very, very simple project and should only take a couple of minutes.\n\nXây dựng một website có chức năng giao hàng cho các quán cà phê và nhà hàng 6 days left\n\nTên dự án (Project Name): Whorespresso, với ý nghĩa chiếc xe tải (làm từ Coffee Truck) đặc biệt, được đưa đến cơ sở sản xuất cà phê, đặt chung gói hàng (với chữ In House), và đưa"}

I have no clue why it is different, I have tried to set all the attributes to the same as the playground. The prompt is a copy paste from the code, etc.

1 Like