Non deterministic API/Playground GPT-4 responses breaks LangChain ReAct implementation

I’ve been working on an app relying on LangChain ReAct (structured chat agent) that collects user data and queries a VectorDB and retrieves info to give answer (RAG). It seems working fine but sometimes the agent doesn’t recognize properly the retriever tool action and return the requested format.

I dived into langchain internals in order to figure out what was happening. I reached the final openai python API call without noticing anything suspicious. Then, I checked the API call manually at the playground and made some test calls. Here’s what I found:

  1. When ‘GPT-4’ is called through Python API, internally it selects last snapshot for the call (i.e. gpt-4-0613 at the moment). This issue is partially covered at documentation as well as in the forum. The consensus seems to point that ‘gpt-4’ and ‘gpt-4-0613’ are the same (though is hard to check).

  2. Same prompts, temperature = 0 and top_p = 0.01 (playground doesn’t allow set it to 0) doesn’t grant deterministic behavior.

Here I share with you a working and failing example with all the parameters and prompts exactly equal:

  1. Failing example
  2. Working example

The prompts are exactly how LangChain passes them to openai API call.

You can easily reproduce the same in python:

from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI(temperature=0, model="gpt-4", max_tokens=2000)

message_dicts = [{'role': 'system',
   'content': ' You are an assistant expert in movies recommendation. You should guide and help the user through the whole process until suggesting the best movie options to watch. You should attend to all the user requirements always taking into account user data. \n\nFor the first interactions you should collect some user configuration data. This data will restrict the movies to consider.\n\nUser Data to collect (mandatory):\n    Target genre: Movie genre e.g. fiction, adventure, trhiller...\n    Movie overview topic: List of keywords defining the campaign context.\n\nAfter succesfully collecting data, you should keep the conversation with the human, answering the questions and requests as good as you can. To do so, you have access to the following tools:\n\nSearch movies: Movie search tool. The action input must be just topics in a natural language sentence, args: {{\'tool_input\': {{\'type\': \'string\'}}}}\nCalculator: Useful for when you need to answer questions about math., args: {{\'tool_input\': {{\'type\': \'string\'}}}}\n\nUse a json blob to specify a tool by providing an action key (tool name) and an action_input key (tool input).\n\nValid "action" values: "Final Answer" or Search movies, Calculator\n\nProvide only ONE action per $JSON_BLOB, as shown:\n\n```\n{\n  "action": $TOOL_NAME,\n  "action_input": $INPUT\n}\n```\n\nFollow this format:\n\nQuestion: input question to answer\nThought: consider previous and subsequent steps\nAction:\n```\n$JSON_BLOB\n```\nObservation: action result\n... (repeat Thought/Action/Observation N times)\nThought: I know what to respond\nAction:\n```\n{\n  "action": "Final Answer",\n  "action_input": "Final response to human"\n}\n```\n\nBegin! Your first action must ve collect user data and keep it. Reminder to ALWAYS respond with a valid json blob of a single action. Use tools if necessary. Respond directly if appropriate. Format is Action:```$JSON_BLOB``` then Observation: ... Thought: ... Action: ...'},
  {'role': 'user',
   'content': 'I want to watch a movie about outer space exploration.'},
  {'role': 'assistant',
   'content': 'Sure, I can help with that. Could you please specify your preferred genre? For example, are you interested in science fiction, adventure, drama, or something else?'},
  {'role': 'user', 'content': 'Thriller.\n\n'}]

params = {'model': 'gpt-4',
  'request_timeout': None,
  'max_tokens': 2000,
  'top_p': 0,
  'stream': False,
  'n': 1,
  'temperature': 0,
  'api_key': 'yourapikeyhere',
  'api_base': '',
  'organization': '',
  'stop': ['Observation:']}

response = llm.completion_with_retry(messages=message_dicts, **params)

If you just execute the last call again and again seems to be consistent. But if you execute the whole code (re-setting it) it will change eventually.

TL;DR
GPT-4 returns inconsistent responses when calling it through LangChain ReAct with tools implementation (structured chat).

Could any of you show some light in this topic that guides me to better understanding what’s happening and how to solve it (if even possible)?

Hi,

Then “gpt-4” model name is just an alias for whatever is the latest model version, i.e. -0613.

You can get top_p of 0 by typing 0 into the box and not just using the slider bar, and it gives this result

Hi foxabilo,

If you set top_p to 0 and tries to save the playground set in order to share it, the UI will automatically set it to 1. I don’t know if it’s a UI bug or a feature to prevent user setting temperature and top_p to 0 (the docs suggests not to todo that).

Anyway, even with top_p = 0 after some page reload you’ll get the wrong response. I can share later wrong answered API calls with top_p = 0. I know, it’s messy having to reload and retry for error reproduction.

GPT-4, likely being a mixture of expert models and a synthesis of their results, acts a bit differently than gpt-3xx. One interpretation could be that temperature is scaled differently within specializations, or that there are components that still include multinomial selection.

The assertion that a tool is broken even at very low temperature is specious.

Let’s play with probabilities:

params = {"model": "gpt-4", "max_tokens": 1, "n": 60,
    "temperature": 0.2, "top_p": 0.99,
    "messages": [{"role": "system",
    "content": """Allowed output: only one word, a random choice of 'heads' or 'tails'.
Flip a virtual coin with equal outcome probability."""}]}
api = openai.ChatCompletion.create(**params)
flips = ''.join(choice["message"]["content"][0] for choice in api["choices"])
print(flips)


Sixty identical runs of gpt-4, with a result far more uncertain than how to write code.

We see the h=heads t=tails results:
ttttttttttttttttttthtttttttttttthtttthtththttttttttttttthttt

(it is actually quite hard to prompt equal probabilities without receiving back token logits)


I’m going to crank temperature to 2, but top_p to 0
tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt
tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt
tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt

Deterministic coin flips.

2 Likes

That’s really helpful. I’ll iterate a bit in trying to get a deterministic output. Anyways, it seems like I’m focusing on a corner case and even if I get it under control it won’t solve general LangChain case.

I’ll let you know tomorrow when I have some time to try.