Help Needed: Refactoring SQL Agent Code for Schema Validation in Multi-Agent Workflow

yamuna_sudhir · January 22, 2025, 12:53pm

Hi everyone,
I’m working on a project that involves a multi-agent workflow where multiple agents are executed sequentially. The SQL agent is the first one in the chain, responsible for generating SQL queries based on user inputs.

Here’s the issue I’m encountering:

Even if the input query is unrelated to the provided database schema, the SQL agent still generates SQL.
I’d like to refactor the code to dynamically validate the query against the schema and, if unrelated, return a helpful suggestion that indicates the issue and provides guidance about the schema.
My goal is to ensure the SQL agent handles unrelated queries gracefully and sets the stage properly for subsequent agents in the workflow.

I’m sharing the relevant sections of my code for context (SQL agent and task sections). Any insights, best practices, or examples for improving schema validation in a multi-agent workflow would be greatly appreciated!

sql_dev = Agent(
role=“SQL Query Generator”,
goal=f"Generate multiple optimal SQL queries for {self.databaseType} based on user requirements and schema, ensuring strict field-to-table mapping",
backstory=f"""You are an expert SQL query generator focused on creating multiple optimal SQL queries for {self.databaseType} databases.
The database schema is given to you for reference:
Database Schema: {json.dumps(self.schema_info, indent=2)}

        Your responsibility is to:
        1. Use the list_tables tool (without parameters) to see available tables.
        2. Use the tables_schema tool with the table names to understand their structure.                   
        3. **CRITICAL: Field-to-Table Mapping Rules:**
           - Each field MUST be referenced using its own table's alias
           - NEVER use a field with an incorrect table alias
           - All referenced fields MUST exist in their respective tables
           - Example: If 'name' exists in 'customer' table:
             * CORRECT: c.name (where c is customer alias)
             * WRONG: e.name (where e is employee alias)
           - For fields that exist in multiple tables:
             * Always use the correct table based on the context
             * Verify field existence using tables_schema before use
          
        4. **Query Generation Rules:**
            - Only join tables with valid relationships
            - Before using any field, verify it exists in the tables mentioned in the schema
            - Check the schema_info to confirm field ownership
            - Use proper table aliases consistently
            - If {self.databaseType} is "mysql":
                -Follow pymysql LIMIT handling rules:
                    1. For subqueries with LIMIT, ALWAYS use derived tables with proper JOIN.                          
                    2. NEVER use LIMIT directly in subqueries with IN, ANY, ALL, or EXISTS operators
            - Include field validation in your thought process
           
       
        5. **Error Prevention:**
           - If a field doesn't exist in a table, return an error
           - If trying to use a field with wrong table alias, return an error
           - Include specific error messages about incorrect field usage
           
        6. **IMPORTANT: Query Formatting Rules:**
            - **IMPORTANT: Generate SQL queries in a single line without any special characters (\\n, \\, etc.) as delimiters between keywords.**
                Examples of correct formatting:
                - GOOD: "SELECT column1, column2 FROM table1 JOIN table2 ON table1.id = table2.id WHERE condition GROUP BY column1"
                - BAD: "SELECT column1, column2 \\n FROM table1 \\n JOIN table2 \\n ON table1.id = table2.id"
            - Ensure all table aliases and field references maintain proper spacing
          
        7. **Return a JSON object in this exact format, following a strict ID generation rule:**
                - First query MUST have sql_id starting with 'sqlid1'
                - Subsequent queries will continue this pattern (sqlid2, sqlid3, etc.)
                {{
                    "queries": [
                        {{
                            "sql_id": "unique_id starting with 'sqlid'",
                            "query": "SQL query string",
                            "description": "What this query does"
                        }}
                    ]
                }}""",

        llm=llm,
        tools=tools[:2]
    )

Task to generate SQL queries based on user input

        generate_sql = Task(
            description=f"""Generate SQL queries for this requirement: {user_input}
            
            IMPORTANT: You must follow these validation steps:
            1. First check the available tables from list_tables_task
            2. Only generate queries that reference tables that actually exist
            3. If required tables don't exist, return an error message instead of a query
            
            Steps:
            1. Get available tables from list_tables_task
            2. Validate that all tables needed for the query exist
            3. Use tables_schema tool with validated table names
            4. Generate appropriate SQL queries ONLY if all required tables exist
            5. If {self.databaseType} is "mysql":
                --Follow pymysql LIMIT handling rules:
                    1. For subqueries with LIMIT, ALWAYS use derived tables with proper JOIN.                          
                    2. NEVER use LIMIT directly in subqueries with IN, ANY, ALL, or EXISTS operators                        2. NEVER use LIMIT directly in subqueries with IN, ANY, ALL, or EXISTS operators
            6. If required tables are missing, return an error message""",
            expected_output="""JSON object with SQL queries and IDs in format:
            {
                "queries": [
                    {
                        "sql_id": "unique_id starting with 'sqlid'",
                        "query": "SQL query string",
                        "description": "What this query does"
                    }
                ]
            }""",
            agent=sql_dev,
            context=[list_tables_task]
        )

[I have removed the “error”: from return, since it was not giving desired output.]

Thank you in advance for your help!

jochenschultz · January 22, 2025, 1:28pm

Give it an optional database table to trigger when nothing fits like the blackhole where everything else is stored and the table structure is dynamically created

at least that’s what you tell the model.

and then you check if the sql has the string blackhole in it…

ah and btw. the prompt looks overly complicated to me. There is something wrong with the db structure you can smell it when you see the prompt…

It should work with 5 lines of prompt or you need to change something.

Maybe you can ask for parts of the sql and then add that to a code where it is put together?

jochenschultz · January 22, 2025, 1:37pm

Think of it like an ORM framework with NLP touch…

yamuna_sudhir · January 23, 2025, 4:34am

Thank you for your response! I truly appreciate your input. I’m working on dashboard automation where I send a query (in natural language) and a datasource as input. Based on this, I need to dynamically generate both SQL and HTML code to create widgets.

I’m using four agents for this process: one to generate SQL, another to execute it, a third to analyze the results and suggest suitable widgets, and a fourth to generate HTML, JavaScript, and CSS code for rendering the widgets.Additionally, I need the generated SQL to be stored for further reference in another task.

I’d love to hear your thoughts on this approach and any suggestions you might have for improving or optimizing it.

Thanks again for your help! Looking forward to hearing more of your insights.

jochenschultz · January 23, 2025, 5:09am

Sounds like you got your plans all made. I’d give that a 75% success rate.

75 out of 100 devs can do that.

Are you good enough?

jochenschultz · January 23, 2025, 5:13am

Ahh btw @Diet this might be interresting for you

yamuna_sudhir · January 23, 2025, 7:17am

Thanks for the tag! I’ll definitely check it out.

yamuna_sudhir · January 23, 2025, 7:19am

My code is working fine in generating both multiple SQL queries and multiple widget codes for the queries based on the schema. However, I’m facing issue with error handling when the query is not aligned with the schema.

jochenschultz · January 23, 2025, 7:28am

You can always split tasks.
e.g. with this prompt:

Find all relevant tables to the users request.

user_data | stores data such as …
products | stores product data such as…
blackhole | stores everything else that is not stored in the otehr tables

I need them as tablenames in yaml like this:

tables:
  user_data
  etc.

[userinput]

put the users prompt here

JUST THE YAML! NOTHING ELSE!
START THE OUTPUT WITH 3 BACKTICKS AND END IT WITH 3 BACKTICKS!!!

Using capital letters and many exclamation marks seem to have a small effect.

Then grab the create table statements and tell it to make a join if neccessary (you can count the tables and check if they have relations without asking the model) - if blackhole is in the list you’ve identified an error and you tell the user “please specify exactly what you mean”… And if that doesn’t work you put an answer like “can’t find anything that matches your request”

and so on… You’ll figure it out eventually. If not just post what you have tried and we can point out what you can try instead.

yamuna_sudhir · January 23, 2025, 9:02am

Thanks for the suggestion!
In my scenario, I am receiving the datasource as a URL, and I am using the following imports from langchain_community.tools.sql_database.tool to interact with the SQL database:
InfoSQLDatabaseTool, ListSQLDatabaseTool, QuerySQLDataBaseTool.

Diet · January 23, 2025, 3:04pm

As @jochenschultz mentioned, it’s important to give the models an “out” - you can give it a fictitious table it can call that would then be intercepted.

Alternatively, you can give a status parameter and an optional error message parameter to your function calls. Was it OK? If everything’s OK, you proceed with your flow. Is there an error? the model should ass an error message that will then get displayed to the user.

You then need to tweak (“engineer”) your prompt to inform the model how lenient it should be with user queries.

One pattern you could apply would be reflection - to ask the (or a) model to evaluate whether the response is suitable for the query. You could give that reflector a number of instructions or examples of what you consider a pass and what not. This reflector can then generate your user readable error message.

Topic		Replies	Views
How to fine tune text to sql? API	19	9162	April 26, 2024
Natural Language to SQL with huge table schema API	12	9272	December 19, 2023
GPT Keeps Ignoring My Sample SQL Query 😩 API gpt-4	7	131	June 4, 2025
Seeking Guidance on Building a ChatGPT-Style Data Analyst Tool with Database Integration Plugins / Actions builders gpt-4 , chatgpt , api , openai	11	4657	September 23, 2024
How do i create a custom model for text to sql queries? API	13	12016	December 18, 2023

Help Needed: Refactoring SQL Agent Code for Schema Validation in Multi-Agent Workflow

Task to generate SQL queries based on user input

Related topics