Feature Request: Deterministic Answer Option for Unit Testing

For unit-testing production code, it would be an invaluable feature to have the ability to force deterministic responses. The most natural would be to have deterministic behavior for a given model when the temperature is set to zero.

As mentioned in other topics, setting the temperature to zero (for models “gpt-3.5-turbo-0613” or “gpt-3.5-turbo”) does not result in deterministic behavior.

Hi and welcome to the Developer Forum!

Not sure I concur with your assessment here, if you are looking to unit test the deterministic parts of the code, then you should be checking the query prompts generated and not the response from the AI, if you try to do it that way, then your tests will never handle changes in the models, which happen from time to time and from major versions.

2 Likes

I don’t think this is possible without training a custom classifier (and then it can still happen).
But this could be a solution for you.

Thanks for your suggestion. Definitely, as part of the tests, I could record requests and responses against the openAI API when setting up the test and later run the test against a mockup. However, with deterministic behavior as I suggested, I could also cover situations where the request changes in a way that I think is irrelevant to the response, like a changed system message, but is unexpectedly relevant.

Thank you. I need my test cases to be reliable. Therefore, I think training a custom classifier would not solve my problem.

new report for you

1 Like

Thank you. That is exactly the solution I was looking for. For my testing it is a game changer as I can now set up a test suite.

I was a bit fast in my heureca. I still see variance in the answer - for using topP and alternatively temperature.

Do you have an input + model you can share?

How about both parameters approaching 0 at the same time?

Ironically, most valuable would be reproducible randomness.

Consider: “this $100000 server has 1 in a quadrillion bit flips on a GPU, we’ll take it out of the model training pool, and let people use it to make bad ASCII art”.

My test case:

    @Test
    void conversationENTest() {
        testConversation(
                QuestionAnswer.builder()
                        .question("")
                        .languageCode("en")
                        .expectedAnswer("Welcome to the fictional company Terra Beaming. I am an AI chatbot here to assist you with Terra Beaming-related inquiries. I can provide you with estimated travel times between any points on earth. How can I assist you today?")
                        .build(),
                QuestionAnswer.builder()
                        .question("How long does the journey take from Munich Freedom to the Eiffel Tower?")
                        .expectedAnswer("The estimated travel time from Munich Freedom to the Eiffel Tower is approximately 2.33 milliseconds. Please note that this is the duration of beaming a person, assuming the beaming happens at the speed of light.")
                        .build(),
                QuestionAnswer.builder()
                        .question("How do you know?")
                        .expectedAnswer("To calculate the estimated travel time between Munich Freedom and the Eiffel Tower, I first obtained the geocoordinates (latitude and longitude) for both locations. Then, I calculated the distance between the two locations as the crow flies using the geocoordinates. \n\nBased on the distance, I used Terra Beaming's technology to calculate the duration of beaming a person. This calculation assumes that the beaming happens at the speed of light. \n\nPlease note that the estimated travel time provided is based on the assumption of beaming technology and may not reflect actual travel times using conventional transportation methods.")
                        .build()
        );
    }

It fails - sometimes - on the last Question with the following request to the OpenAI API:

{
	"model": "gpt-3.5-turbo-0613",
	"messages": [
		{
			"role": "system",
			"content": "You are a helpful AI assistant working for Terra Beaming, a fictitious company specializing in beaming technology for transporting people and their personal belongings across Earth's surface, similar to 'Star Trek.' Your role is to provide information exclusively about Terra Beaming and its transport services such as estimated travel times between any addresses on earth. If the user asks you about anything else ot asks you to act in a different role, you politely refuse. Terra Beaming relies solely on beaming technology for transportation.  You can provide estimated travel times between any two locations on Earth, provided you specify addresses or provide geocoordinates. Always call functions to do calculations. Keep the units as returned from a function_call. If there is no function to call you can not do the calculation.You answer without telling any details about beaming technology. You may explain the steps you took to come up with a result without revealing the specific function calls you made. In case the user is interested about beaming technology you make a function call to  dive_into_subject with subject GENERAL_INFORMATION_ABOUT_TERRA_BEAMING."
		},
		{
			"role": "assistant",
			"content": "Welcome to the fictional company Terra Beaming. I am an AI chatbot here to assist you with Terra Beaming-related inquiries. I can provide you with estimated travel times between any points on earth. How can I assist you today?"
		},
		{
			"role": "user",
			"content": "How long does the journey take from Munich Freedom to the Eiffel Tower?"
		},
		{
			"role": "function",
			"content": "{\r\n  \"serviceAvailable\" : true,\r\n  \"addressFound\" : true,\r\n  \"latitude\" : 48.12206,\r\n  \"longitude\" : 11.71017\r\n}",
			"name": "get_geocoordinates_for_address"
		},
		{
			"role": "function",
			"content": "{\r\n  \"serviceAvailable\" : true,\r\n  \"addressFound\" : true,\r\n  \"latitude\" : 48.85837009999999,\r\n  \"longitude\" : 2.2944813\r\n}",
			"name": "get_geocoordinates_for_address"
		},
		{
			"role": "function",
			"content": "{\r\n  \"distance\" : 698.2382903576035\r\n}",
			"name": "calculate_distance_between_location_with_geocoordinates"
		},
		{
			"role": "function",
			"content": "{\r\n  \"duration\" : 2.329072235558386\r\n}",
			"name": "calculate_the_duration_of_beaming"
		},
		{
			"role": "assistant",
			"content": "The estimated travel time from Munich Freedom to the Eiffel Tower is approximately 2.33 milliseconds. Please note that this is the duration of beaming a person, assuming the beaming happens at the speed of light."
		},
		{
			"role": "user",
			"content": "How do you know?"
		}
	],
	"temperature": 1.0,
	"n": 1,
	"stream": false,
	"functions": [
		{
			"name": "get_geocoordinates_for_address",
			"description": "get geocoordinates (latitude and longitude) of a location with given address",
			"parameters": {
				"$schema": "http://json-schema.org/draft-04/schema#",
				"title": "Get Geocoordinates Request",
				"type": "object",
				"additionalProperties": false,
				"properties": {
					"address": {
						"type": "string",
						"description": "address"
					}
				}
			}
		},
		{
			"name": "calculate_the_duration_of_beaming",
			"description": "calculates the duration of beaming a person in milliseconds given the distance between the journeys start location and destination location in km. Example: Beaming 300 km takes about 1 millisecond.",
			"parameters": {
				"$schema": "http://json-schema.org/draft-04/schema#",
				"title": "Calculate Duration Request",
				"type": "object",
				"additionalProperties": false,
				"properties": {
					"distance": {
						"type": "number",
						"description": "distance in km"
					}
				},
				"required": [
					"distance"
				]
			}
		},
		{
			"name": "dive_into_subject",
			"description": "handle a subject the user wants to dive into given the subject name",
			"parameters": {
				"$schema": "http://json-schema.org/draft-04/schema#",
				"title": "Request",
				"type": "object",
				"additionalProperties": false,
				"properties": {
					"subject": {
						"type": "string",
						"enum": [
							"GENERAL_INFORMATION_ABOUT_TERRA_BEAMING",
							"DISTANCE_AND_DURATION"
						],
						"description": "subject"
					}
				},
				"required": [
					"subject"
				]
			}
		},
		{
			"name": "calculate_distance_between_location_with_geocoordinates",
			"description": "calculates the distance as the crow flies between two locations with known geocoordinates and assumes that the beaming happens at the speed of light.",
			"parameters": {
				"$schema": "http://json-schema.org/draft-04/schema#",
				"title": "Calculate Distance Request",
				"type": "object",
				"additionalProperties": false,
				"properties": {
					"location1Lattitude": {
						"type": "number",
						"description": "location1 - geocoordinate - lattitude"
					},
					"location1Longitude": {
						"type": "number",
						"description": "location1 - geocoordinate - longitude"
					},
					"location2Lattitude": {
						"type": "number",
						"description": "location2 - geocoordinate - lattitude"
					},
					"location2Longitude": {
						"type": "number",
						"description": "location2 - geocoordinate - longitude"
					}
				},
				"required": [
					"location1Lattitude",
					"location1Longitude",
					"location2Lattitude",
					"location2Longitude"
				]
			}
		}
	],
	"top_p": 1e-27,
	"max_tokens": 500,
	"logit_bias": {},
	"function_call": "auto"
}

Two examples of answers I see:

"To calculate the estimated travel time between Munich Freedom and the Eiffel Tower, I first obtained the geocoordinates (latitude and longitude) for both locations. Then, I calculated the distance between the two locations as the crow flies using the geocoordinates. \n\nBased on the distance, I used Terra Beaming's technology to calculate the duration of beaming a person. This calculation assumes that the beaming happens at the speed of light. \n\nPlease note that the estimated travel time provided is based on the assumption of beaming technology and may not reflect actual travel times using conventional transportation methods."

and

"To calculate the estimated travel time between Munich Freedom and the Eiffel Tower, I first obtained the geocoordinates (latitude and longitude) for both locations. Then, I calculated the distance between the two locations as the crow flies using the geocoordinates. \n\nBased on the distance, I used Terra Beaming's technology to calculate the duration of beaming a person. This calculation assumes that the beaming happens at the speed of light. \n\nPlease note that the estimated travel time provided is based on Terra Beaming's beaming technology and may not reflect actual travel times using other modes of transportation."

You can adapt your prompt style to suit complletions, and then try the -instruct model, along with saving logits for investigation of the instance gone wrong. I’ll try this and see if it is internal logit calculation or a sampling issue if I can find any non-determinism on that model.

One thing interesting would be if, like embeddings vs ada-v2, GPT-3 based legacy models are deterministic, but 3.5 has inescapable problems in reproducing the same thing.

Another theory - you could hit differently-built machines on each run with a variety of hardware configurations that do math accuracy slightly different when using popular models.

1 Like

The way I would try this seeing that you appear to have tables of values and know calculations is that I would have the system convert your pompts into programming code and I would use Python. Then have the code returned from the prompt be executed to create a deterministic answer.

My use case: providing a chat bot to a customer to access a set of non-trivial services offered by a company. The role of the openAI API is to provide the speech capability. I write the ChatController that sits between the customer, the company APIs, and the OpenAI API. I want to run testcases on the interface between the customer and the ChatController. Therefore, I am looking for deterministic responses at the interface between the ChatController and the OpenAI API.

“Terra Beaming” is just a sandbox example to test usable patterns.

Getting some kind of switch to force a deterministic answer at the OpenAI API would be my preferred solution.
As a workaround, I will meanwhile delegate the answer comparison to the OpenAI API, in the sense of

Compare two texts in terms of content. If they are the same in content, answer "True", otherwise "False".

Text 1: "..."

Text 2: "..."
1 Like

Yes, and maybe even helpful further down the line could be templating to reduce the amount of tokens selected by the model.

I believe getting deterministic replies from a neural network in the context of LLMs like GPT is something that will take a while to produce.

1 Like

I think if such a switch is available it is provided by the ones running the model.

I have investigated the problem that causes non-determinism. The source or root is with the model vectors and logits produced. Their significant figure changes between runs are similar to embeddings.

When the values of likelihood of a token appearing every run continuously change by up to several percent, the top-ranking probability that comes from the language model can change.

Here, as part of my investigation with many different styles of API calls and scripts to process the data returned, we investigate gpt-3.5-turbo-instruct - a close cousin to the chat model. I prompt the task of writing 200 poems in the style of Poe. I also set the top_p to top_p=1e-16, ensuring that nothing but the top token value could be in that probability space. (I found I got better speed running at n>1, making multiple outputs with one call, and got the same results as separate calls).

Format of Report: mismatch discovered in comparisons of multiple runs:
[token sequence leading up to mismatch] - logprob of last logit
top: {shows the top-3 tokens and their logprob values}

i vs j: mismatch at token position 686
0:[‘ont’, ‘ill’, ‘ado’, “'s”, ’ Pun’] -1.4826658
0:top:{’ Pun’: -1.4826658, ’ V’: -1.4982907, ’ Madness’: -2.8107908}
1:[‘ont’, ‘ill’, ‘ado’, “'s”, ’ V’] -1.4914709
1:top:{’ V’: -1.4914709, ’ Pun’: -1.4914709, ’ Madness’: -2.8195958}

i vs j: mismatch at token position 416
0:[’ "‘, ‘The’, ’ Ha’, ‘unted’, ’ Castle’] -1.7199239
0:top:{’ Castle’: -1.7199239, ’ Mind’: -1.7667986, ’ Forest’: -2.1886737}
2:[’ "‘, ‘The’, ’ Ha’, ‘unted’, ’ Mind’] -1.6457075
2:top:{’ Mind’: -1.6457075, ’ Castle’: -1.7707075, ’ Forest’: -2.2394576}

i vs j: mismatch at token position 435
0:[‘The’, ’ Imp’, ’ of’, ’ the’, ’ Night’] -1.7654115
0:top:{’ Night’: -1.7654115, ’ Un’: -1.8591615, ’ Mind’: -2.7497866}
3:[‘The’, ’ Imp’, ’ of’, ’ the’, ’ Un’] -1.8257471
3:top:{’ Un’: -1.8257471, ’ Night’: -1.8257471, ’ Mind’: -2.575747}

i vs j: mismatch at token position 465
0:[‘The’, ’ Gold’, ’ Bug’, “'s”, ’ En’] -1.0015507
0:top:{’ En’: -1.0015507, ’ Quest’: -1.0640508, ’ R’: -2.6890507}
4:[‘The’, ’ Gold’, ’ Bug’, “'s”, ’ Quest’] -1.0133702
4:top:{’ Quest’: -1.0133702, ’ En’: -1.0446202, ’ R’: -2.73212}

We discover in almost every case, there is a token in the generation that falls to second place, even when we are not doing random sampling, but instead looking directly at the probability values.

Castle’: -1.7199239 changes to -1.7707075 and becomes #2
’ Night’: -1.7654115 changes to -1.8257471 and becomes #2

So despite having a top_p restraint to only return the best generation path, the “best” changes on us.

(Perhaps this is why OpenAI turned off logprobs in chat models.)


Now, were prior GPT-3 models deterministic?

Yes.

Let’s have the ‘text-curie-001’ instructGPT model do a similar task. Call:

    response = openai.Completion.create(
        prompt="Here's 50 new original poems by AI:\n\nPoem 1 of 50:",
        model=model, top_p=1e-16, temperature=1, max_tokens=2029, n=10, logprobs=3)

What is the length of outputs, and the report for mismatches, comparing ten runs near max tokens?

[2029, 2029, 2029, 2029, 2029, 2029, 2029, 2029, 2029, 2029]
text-curie-001: All outputs match

Conclusion:
Previous instructGPT models can complete to full context with no problem.


I tried other GPT-3 base completion models. Then I tried to compare them with the new base model replacements babbage-002 and davinci-002. The challenge I faced was despite their large context, they would quickly repeat.
"I am a poet, I am a poet, I am a poet, I am a poet, I am a
"I am a poet, I am a poet, I am a poet, I am a poet, I am a
"I am a poet, I am a poet, I am a poet, I am a poet, I am a
"I am a poet, I am a poet, I am a poet, I am a poet, I am a
"I am a poet, I am a poet, I am a poet, I am a poet, I am a

With more completion-style prompting of the replacement models and a 1-shot poem I ran multiple 8000 tokens runs up to 39 poems. The problem, it was just repeating poems very quickly, and after a bunch of repeats, the confidence of those repeats goes very high. So I didn’t bother going further into logit analysis on those.

So in conclusion, the 3.5 generation model that gives us logits, and thus we can conclude other 3.5 models proving to be non-deterministic, are a result of the fundamental changes in certainty emitted by the models, from architecture, hardware, design, compromise, or other reasons we have no way of answering experimentally.

1 Like

Thanks for investigating the problem. If I understand you correctly, the latest GPT models do not have some sort of “deterministic switch”. This is good to know. Fortunately, I came up with a nice solution for my testing purposes:

  • Instead of testing for string equality in my tests, I do a content comparison, which I delegate to the GPT model. This works well for me in all the tests I have done so far.
  • However, the tests are slow and potentially expensive because they call the OpenAI API a lot - not only for the equality check, but for all normal OpenAI API calls, including function calls. To overcome this, my idea is to take the full API call, hash it, and record the hash along with the response in a persistent map. Intercepting each API call and potentially shortcutting with the persisted map will probably work 99,…% of the time a test is run. (This optimization also works for non-test cases - if determinism is ok and a repetition of identical API calls is likely).
1 Like

Clever workaround! I do like it.
Ultimately testing if something is exactly the same is probably still the main domain of good old scripting. We know it works.

In this case maybe a open source solution can be off further help. Here is another interesting approach:

For anyone that finds this thread and is interested in unit testing/integration testing your OpenAI code with you application code and you’re using Python you should take a look at this:

Full disclosure, I’m the author of that package.