Coping with inconsistent results on identical inputs

I understood we can’t garantee identical result for a same prompt with multiple runs. That can be quite an issue if you don’t have a downstream solution to align those.
In the analogous world, that’s not really a problem since we can easily compute a distance between responses.
In the discrete world, where you ask gpt to return a string (like a category in my case), it is not obvious to compare responses. Has anyone found a ready2use solution based on a dictionary of synonyms for example ?

Hi and welcome to the Developer Forum!

You need to narrow down the solution space. One of the simplest ways of doing this is to request that the output be in a known format, like JSON. If you then also provide the model with a template that lists the available options and instructions to find the closest match, it will do so.

4 Likes

do you mean like embeddings? :thinking:

Adding to Foxabilo’s response, this sounds like a good old fashioned finetuning problem to me. You can create a system or user prompt listing the available categories. Then finetune a GPT 3.5-turbo model based on input/output pairs, whereby the output corresponds to the category in your desired format. In practice, this approach works extremely well, meaning you won’t have to worry about the model deviating from the defined list of categories. Depending on the nature of categories, you may still experience cases of misclassification. However, even then the model will not deviate from the list of categories provided.

I have a few of these finetuned models, anywhere in the range of 7 to 50 categories and it works decently.

1 Like

The introduction of the seed parameter should help to significantly reduce this though. More info in the cookbook here

2 Likes

thank you guys! that’s exactly what I was after. I’ll try it now.

1 Like

tried and got same issue after using same seed everytime.
Note:

  • I’m using model gpt-4-1106-preview and functions to format the output. …
  • I’m using REST API v1

Output is different despite system_fingerprint comes back with same value.

It would be great to implement the seed in the playground UI so we can test and provide logs more easily to openAI dev team.

Btw , from the cookbook: " If the seed , request parameters, and system_fingerprint all match across your requests, then model outputs will mostly be identical."

=> I have no idea what “mostly” identical means.

Probably 99.9%. I have used seeds in Midjourney for a while and almost every pixel is identical except a few. So you should expect almost every character in the response to be identical in that case.

Sorry to hear that :\

Fixing seed and temperature, should keep the output constantly about 99.99% of the time (the number is anecdotal). For what it’s worth, gpt-4-1106 is still a preview model.

Are you using tool calling or json mode?

1 Like

here’s my code (sorry it is PHP ;)):

<?php

$debug=true;
function log_msg($msg)
{
  global $debug;
  if ($debug){
    $fp = @fopen("./requests.txt","a");
    fwrite($fp,get_timestamp().$msg."\n");
    fclose($fp);
  }
}
function get_timestamp()
{
  $timestamp = "[".date("y/m/d-h:i:s")."]:";
  return $timestamp;	
}
$functions = [
  [
    "name" => "formatAsObject",
    "description" => "format the list of tags proposed as an array",
    "parameters" => [
      "type" => "object",
      "properties" => [
        "tagsArray" => [
          "type" => "array",
          "items" => [
            "type" => "string",
            "description" => "A tag"
          ],
          "description" => "The array of tags"
        ],
        "broadCategory" => [
          "type" => "string",
          "description" => "The broad category of the document"
        ],
        "specificCategory" => [
          "type" => "string",
          "description" => "The slightly more specific category of the document"
        ]
      ],
      "required" => ["tagsArray","category"]
    ]
  ]
];
function generatePrompt($title,$description)
{
  return "give 3 english tags (1 word long maximum ) for document titled ".$title.".Description of the document is \"".$description."\".If a first and a last name are quoted, combine them in a fourth tag with a space in between. Minimize tag name complexity and use a sober style without hyperbolic expressions.".
  "Provide a broad and a slightly more specific category where would this document fall in. You can create new categories to be translated in english if document does not fit into any. Use most popular term for the categories. 
  ";
}
$key="sk-------------------------";
$title=$_POST["title"];
$description=$_POST["description"];
$GPT_MODEL = "gpt-4-1106-preview";

$url="https://api.openai.com/v1/chat/completions";
$headers = [
  'Authorization: Bearer ' . $key,
  'Content-Type: application/json',
];
$data = array(
  "model" => $GPT_MODEL,
  "messages" => array(
      array("role" => "user", "content" => generatePrompt($title,$description)
      )
  ),
  "max_tokens" => 50,
  "seed" => 12345,
  "temperature" => 0,
  "top_p" => 1,
  "frequency_penalty" => 0,
  "presence_penalty" => 0,
  "functions" => $functions,
  "function_call" => array("name" => "formatAsObject")
);
log_msg(generatePrompt($title,$description));
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, json_encode($data));
// Execute the request and capture the response
$response = curl_exec($ch);
if (curl_errno($ch)  ) { 
  $error = error_get_last();
  echo "Error: " . $error['message'] . '<br/>';
  echo "HTTP response: " . $http_response_header[0]  . '<br/>';

} else {
  $jsonData = json_decode($response, true);
  $functionName = $jsonData["choices"][0]["message"]["function_call"]["name"];

  if ($functionName == "formatAsObject"){
    $result = $jsonData["choices"][0]["message"]["function_call"]["arguments"];
    log_msg($result);

    $result = $response;
  }
  
  header('Content-Type: application/json');
  echo $result;
}
?>

You have a single user message without a system role message being passed first.

Let’s make your original message clear, isolating single sentences:

User message, using “title” and “description”

  • Give 3 english tags (1 word long maximum) for document titled {title}.
  • Description of the document is {description}.
  • If a first and a last name are quoted, combine them in a fourth tag with a space in between.
  • Minimize tag name complexity and use a sober style without hyperbolic expressions.
  • Provide a broad and a slightly more specific category where would this document fall in.
  • You can create new categories to be translated in english if document does not fit into any.
  • Use most popular term for the categories.

And then what kind of output do you want, trying to output by function?

Desired output, json-like

tagsArray[] - The array of “tags”
broadCategory - The broad category of the document
specificCategory - The slightly more specific category of the document

I quickly note that your required parameters don’t match the included parameters.

The ambiguity of the instructions, and the mixing of text within the data, will be the primary challenge to understanding and consistency.

I would place the identity and purpose of the AI in a system message, along with any required output formats or other behavior.

Then structure the user message like:

{title}

{document}


description of task

Using a function is not the ideal way to get output you desire. Output format can simply be instructed.

thanks @_j , I’ll reformat my prompt messages. Regarding functions, I thought that one of the main usages was precisely output formatting… But since I wrote the script, I have seen example prompts asking explicitly for json output.
Thanks also for spotting the name incoherencies in the function.

Btw, I had the afterthought: the debate is a bit drifting away so I’ll make some terminology adjustments here to make sure we’re all on same line:

  • similar = close
  • identical = same = exactly equal

I understand that a propper prompt engineering will help getting similar answers in the case of similar or close inputs.
The initial intent of the post was to get identical results based on identical inputs.
I’m still not there after specifying the seed…

The seed will only help you if you have the same input - but if you keep getting exactly the same input, what do you need the LLM for?

What I meant with embeddings earlier is that you go from your “discrete” world to the “analog” world - where you can compare the distances between concepts.

there are different types of embeddings; ranging from word embeddings (basically finding synonyms) to large text embeddings (ada-002 (rip davinci))

you transform your words or your texts into coordinates in a high dimensional space, and then you compare how close the two are, either in terms of euclidean distance, or angle.

that way you can compare non-identical results.

I’m really thankful for all inputs regarding what I consider as a step2 ie getting consistency with similar inputs.
But the scope of this post is consistency with identical inputs (I just narrowed down the post title). And despite setting a seed I still see different outputs coming API call after API call.

@Diet , I’m building a product where user archives its web pages. Each archived page is automatically classified based on content. Category is then used to retrieve easily a given archive.
User can archive multiple temporal versions of a same page. If the computed category changes even if the page content does not change, this will be extremely confusing.
And I’m not talking about testing and other issues a dev team can encounter when reproducibility is not garanted…

A thought:

you could consider giving the model the the previous categories (and maybe the delta) and ask it whether things should be added or removed from the categories.

I personally haven’t had amazing results with removing stuff from lists, but the last time I tried something like that was in may, I think.

It’s possible that removal works more reliably in two passes (ask the model to generate, present it with old categories, and then ask it to compare), but I haven’t tried that.

If you wanna harmonize your categories across documents, you might want to try a similar approach: generate, present preferred results from word net or something, and then compare/confirm.

In any case, you’re gonna have to iteratively work on your prompt to get good results.

You can use the best settings currently available to get the most similar generation:

"seed": (same number),
"temperature": 0.00000000000001,
"top_p": 0.000000001,

The result will still have chances of deviation, from top token flips, when there are probabilities with very similar values. The internals of the model still have jitter in calculations.

The enhancement, then, is to make what the AI receives exceedingly clear and instructed, so that the top token for the AI to produce at each token position is easily distinguished because the answer to be provided is understood.

The alternative is to make it exceedingly confusing, so the only thing the AI can say is pretrained “I’m sorry, but I couldn’t understand your message. Please provide a clear and specific request or prompt, and I’ll be happy to assist you.” — also with high certainty.


Example: Clear instructions can make even a silly question never flip to the second place token, when using a low top_p parameter.

Untitled

(note, with default sampling parameters, the AI would appear to not follow the instructions at least 1.14% of the time by producing 666 anyway…)

Bonus - humans are more decisive than AI

Untitled
or
Untitled-1

Hello @_j do you have any pointer to an openAI documentation clarifying the usage of the seed in the inference ? Seed is normally here to dictate determinism in the probability function. What you’re writing tend to indicate the seed is not used coherently at every stage of the model computations.