Issues and training when updating the LLM model on a project

We are building an AI tool that allows querying data from an SQL database. Langchain is used as the foundation for training models, maintaining chat history, and integrating with different LLM providers/APIs.

The main idea is based on the user writing a natural language question about the data, considering dates and other types of filters. Using an LLM, the goal is to have the model understand the user’s intent and construct the corresponding SQL query to obtain results and return a response.

The LLM training instructions were designed to specify which type of database the query should be built for, what date to consider based on the user’s input, the database structure, what limits should be applied to the queries, the default currency for monetary calculations, etc. All instructions aim to assist in designing the corresponding SQL query.

That being said, the tests were successfully conducted on Azure OpenAI, using the “GPT-4-turbo-1106-preview” model. This model is about to be deprecated/replaced by “GPT-4-turbo-2204-04-09” and, at the same time, Azure has made “GPT-4o”, the new OpenAI model, available.

When testing with “GPT-4-turbo-2204-04-09” and “GPT-4o”, everything that had previously worked stopped functioning. Incoherent responses, poor comprehension of instructions, or simply disregarded instructions. Invalid queries, incorrect results, etc.

We would expect that automatic model updates (such as the case with “GPT-4-Turbo”) on Azure to be more stable, given the short duration of each version, which is around 6 months.

So…

  • Is it normal that changing from one OpenAI model to another results in such a significant difference in comprehension and outcomes?
  • Does any model change necessarily mean the entire set of instructions needs to be refactored?
  • Should we be doing things differently?
  • Are OpenAI models suitable for building such a tool?
1 Like

Yes, but it’s usually not worse

Yes

Probably. You have used a bunch of bootstrapped black boxes together and now are going to have a hell of a time debugging it

Yes

2 Likes

So, the issue is in the training prompts?

Should we avoid using any library and code everything using just the API?

In my opinion, with technology advancing rapidly, it’s essential to have a clear, simple, grounded, and understandable process. This ensures that when something goes wrong, like a prompt not working, you can easily identify and address the root cause.

Libraries like LangChain often abstract away too much of the fundamentals behind RAG, making them difficult to modify & fully understand.

It’s not essential but maybe just a thought down the line. It would most likely lead to cheaper, faster, more maintainable code

1 Like