Hi everyone,
I’m working on a project that involves a multi-agent workflow where multiple agents are executed sequentially. The SQL agent is the first one in the chain, responsible for generating SQL queries based on user inputs.
Here’s the issue I’m encountering:
Even if the input query is unrelated to the provided database schema, the SQL agent still generates SQL.
I’d like to refactor the code to dynamically validate the query against the schema and, if unrelated, return a helpful suggestion that indicates the issue and provides guidance about the schema.
My goal is to ensure the SQL agent handles unrelated queries gracefully and sets the stage properly for subsequent agents in the workflow.
I’m sharing the relevant sections of my code for context (SQL agent and task sections). Any insights, best practices, or examples for improving schema validation in a multi-agent workflow would be greatly appreciated!
sql_dev = Agent(
role=“SQL Query Generator”,
goal=f"Generate multiple optimal SQL queries for {self.databaseType} based on user requirements and schema, ensuring strict field-to-table mapping",
backstory=f"""You are an expert SQL query generator focused on creating multiple optimal SQL queries for {self.databaseType} databases.
The database schema is given to you for reference:
Database Schema: {json.dumps(self.schema_info, indent=2)}
Your responsibility is to:
1. Use the list_tables tool (without parameters) to see available tables.
2. Use the tables_schema tool with the table names to understand their structure.
3. **CRITICAL: Field-to-Table Mapping Rules:**
- Each field MUST be referenced using its own table's alias
- NEVER use a field with an incorrect table alias
- All referenced fields MUST exist in their respective tables
- Example: If 'name' exists in 'customer' table:
* CORRECT: c.name (where c is customer alias)
* WRONG: e.name (where e is employee alias)
- For fields that exist in multiple tables:
* Always use the correct table based on the context
* Verify field existence using tables_schema before use
4. **Query Generation Rules:**
- Only join tables with valid relationships
- Before using any field, verify it exists in the tables mentioned in the schema
- Check the schema_info to confirm field ownership
- Use proper table aliases consistently
- If {self.databaseType} is "mysql":
-Follow pymysql LIMIT handling rules:
1. For subqueries with LIMIT, ALWAYS use derived tables with proper JOIN.
2. NEVER use LIMIT directly in subqueries with IN, ANY, ALL, or EXISTS operators
- Include field validation in your thought process
5. **Error Prevention:**
- If a field doesn't exist in a table, return an error
- If trying to use a field with wrong table alias, return an error
- Include specific error messages about incorrect field usage
6. **IMPORTANT: Query Formatting Rules:**
- **IMPORTANT: Generate SQL queries in a single line without any special characters (\\n, \\, etc.) as delimiters between keywords.**
Examples of correct formatting:
- GOOD: "SELECT column1, column2 FROM table1 JOIN table2 ON table1.id = table2.id WHERE condition GROUP BY column1"
- BAD: "SELECT column1, column2 \\n FROM table1 \\n JOIN table2 \\n ON table1.id = table2.id"
- Ensure all table aliases and field references maintain proper spacing
7. **Return a JSON object in this exact format, following a strict ID generation rule:**
- First query MUST have sql_id starting with 'sqlid1'
- Subsequent queries will continue this pattern (sqlid2, sqlid3, etc.)
{{
"queries": [
{{
"sql_id": "unique_id starting with 'sqlid'",
"query": "SQL query string",
"description": "What this query does"
}}
]
}}""",
llm=llm,
tools=tools[:2]
)
Task to generate SQL queries based on user input
generate_sql = Task(
description=f"""Generate SQL queries for this requirement: {user_input}
IMPORTANT: You must follow these validation steps:
1. First check the available tables from list_tables_task
2. Only generate queries that reference tables that actually exist
3. If required tables don't exist, return an error message instead of a query
Steps:
1. Get available tables from list_tables_task
2. Validate that all tables needed for the query exist
3. Use tables_schema tool with validated table names
4. Generate appropriate SQL queries ONLY if all required tables exist
5. If {self.databaseType} is "mysql":
--Follow pymysql LIMIT handling rules:
1. For subqueries with LIMIT, ALWAYS use derived tables with proper JOIN.
2. NEVER use LIMIT directly in subqueries with IN, ANY, ALL, or EXISTS operators 2. NEVER use LIMIT directly in subqueries with IN, ANY, ALL, or EXISTS operators
6. If required tables are missing, return an error message""",
expected_output="""JSON object with SQL queries and IDs in format:
{
"queries": [
{
"sql_id": "unique_id starting with 'sqlid'",
"query": "SQL query string",
"description": "What this query does"
}
]
}""",
agent=sql_dev,
context=[list_tables_task]
)
[I have removed the “error”: from return, since it was not giving desired output.]
Thank you in advance for your help!