I’ve been working on an app relying on LangChain ReAct (structured chat agent) that collects user data and queries a VectorDB and retrieves info to give answer (RAG). It seems working fine but sometimes the agent doesn’t recognize properly the retriever tool action and return the requested format.
I dived into langchain internals in order to figure out what was happening. I reached the final openai python API call without noticing anything suspicious. Then, I checked the API call manually at the playground and made some test calls. Here’s what I found:
-
When ‘GPT-4’ is called through Python API, internally it selects last snapshot for the call (i.e. gpt-4-0613 at the moment). This issue is partially covered at documentation as well as in the forum. The consensus seems to point that ‘gpt-4’ and ‘gpt-4-0613’ are the same (though is hard to check).
-
Same prompts, temperature = 0 and top_p = 0.01 (playground doesn’t allow set it to 0) doesn’t grant deterministic behavior.
Here I share with you a working and failing example with all the parameters and prompts exactly equal:
The prompts are exactly how LangChain passes them to openai API call.
You can easily reproduce the same in python:
from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI(temperature=0, model="gpt-4", max_tokens=2000)
message_dicts = [{'role': 'system',
'content': ' You are an assistant expert in movies recommendation. You should guide and help the user through the whole process until suggesting the best movie options to watch. You should attend to all the user requirements always taking into account user data. \n\nFor the first interactions you should collect some user configuration data. This data will restrict the movies to consider.\n\nUser Data to collect (mandatory):\n Target genre: Movie genre e.g. fiction, adventure, trhiller...\n Movie overview topic: List of keywords defining the campaign context.\n\nAfter succesfully collecting data, you should keep the conversation with the human, answering the questions and requests as good as you can. To do so, you have access to the following tools:\n\nSearch movies: Movie search tool. The action input must be just topics in a natural language sentence, args: {{\'tool_input\': {{\'type\': \'string\'}}}}\nCalculator: Useful for when you need to answer questions about math., args: {{\'tool_input\': {{\'type\': \'string\'}}}}\n\nUse a json blob to specify a tool by providing an action key (tool name) and an action_input key (tool input).\n\nValid "action" values: "Final Answer" or Search movies, Calculator\n\nProvide only ONE action per $JSON_BLOB, as shown:\n\n```\n{\n "action": $TOOL_NAME,\n "action_input": $INPUT\n}\n```\n\nFollow this format:\n\nQuestion: input question to answer\nThought: consider previous and subsequent steps\nAction:\n```\n$JSON_BLOB\n```\nObservation: action result\n... (repeat Thought/Action/Observation N times)\nThought: I know what to respond\nAction:\n```\n{\n "action": "Final Answer",\n "action_input": "Final response to human"\n}\n```\n\nBegin! Your first action must ve collect user data and keep it. Reminder to ALWAYS respond with a valid json blob of a single action. Use tools if necessary. Respond directly if appropriate. Format is Action:```$JSON_BLOB``` then Observation: ... Thought: ... Action: ...'},
{'role': 'user',
'content': 'I want to watch a movie about outer space exploration.'},
{'role': 'assistant',
'content': 'Sure, I can help with that. Could you please specify your preferred genre? For example, are you interested in science fiction, adventure, drama, or something else?'},
{'role': 'user', 'content': 'Thriller.\n\n'}]
params = {'model': 'gpt-4',
'request_timeout': None,
'max_tokens': 2000,
'top_p': 0,
'stream': False,
'n': 1,
'temperature': 0,
'api_key': 'yourapikeyhere',
'api_base': '',
'organization': '',
'stop': ['Observation:']}
response = llm.completion_with_retry(messages=message_dicts, **params)
If you just execute the last call again and again seems to be consistent. But if you execute the whole code (re-setting it) it will change eventually.
TL;DR
GPT-4 returns inconsistent responses when calling it through LangChain ReAct with tools implementation (structured chat).
Could any of you show some light in this topic that guides me to better understanding what’s happening and how to solve it (if even possible)?