We are building an AI tool that allows querying data from an SQL database. Langchain is used as the foundation for training models, maintaining chat history, and integrating with different LLM providers/APIs.
The main idea is based on the user writing a natural language question about the data, considering dates and other types of filters. Using an LLM, the goal is to have the model understand the user’s intent and construct the corresponding SQL query to obtain results and return a response.
The LLM training instructions were designed to specify which type of database the query should be built for, what date to consider based on the user’s input, the database structure, what limits should be applied to the queries, the default currency for monetary calculations, etc. All instructions aim to assist in designing the corresponding SQL query.
That being said, the tests were successfully conducted on Azure OpenAI, using the “GPT-4-turbo-1106-preview” model. This model is about to be deprecated/replaced by “GPT-4-turbo-2204-04-09” and, at the same time, Azure has made “GPT-4o”, the new OpenAI model, available.
When testing with “GPT-4-turbo-2204-04-09” and “GPT-4o”, everything that had previously worked stopped functioning. Incoherent responses, poor comprehension of instructions, or simply disregarded instructions. Invalid queries, incorrect results, etc.
We would expect that automatic model updates (such as the case with “GPT-4-Turbo”) on Azure to be more stable, given the short duration of each version, which is around 6 months.
So…
- Is it normal that changing from one OpenAI model to another results in such a significant difference in comprehension and outcomes?
- Does any model change necessarily mean the entire set of instructions needs to be refactored?
- Should we be doing things differently?
- Are OpenAI models suitable for building such a tool?