Hi everyone,
I’m working on a text-to-SQL solution for a data warehouse that contains around 20 tables. To streamline access, we’ve created about 10 consolidated views, allowing our users to query across different data segments efficiently. Here’s an overview of our current setup, our solution approach, and the main challenges we’re looking to tackle.
Current Solution Structure:
- Team-Specific Views and Flexibility: Our users come from various teams like sales and marketing. To cater to their specific needs, we assign each team an application ID. This lets us create tailored views on top of the same underlying tables, so each team has a customized perspective without duplicating data or impacting the original structure.
- Embedding and Retrieval Approach: We embed column names, table names, and descriptions in OpenSearch. By using small-3 embeddings, we leverage OpenSearch’s built-in k-nearest neighbors (k-NN) retrieval to identify the top 5 relevant tables for each query. This has allowed us to achieve an accuracy rate of about 70% so far.
- Accuracy Enhancements with Entity Recognition: To improve this 70% accuracy without increasing latency, we’re exploring the integration of named entity recognition (NER) and fine-tuning our SQL database to ensure efficient query processing and higher precision in retrieval.
Key Challenges and Questions:
- Ambiguity Handling: We’re looking for a way to handle ambiguous user queries effectively. Ideally, our system would recognize unclear questions and prompt users for clarification, ensuring accurate SQL generation for complex or vague input.
- End-User Feedback Utilization: Our primary users are non-SQL-savvy and rely on our system to auto-generate SQL queries, though they can validate query results. We’ve provided a feedback option, but we’re unsure how best to leverage this input to systematically improve the SQL outputs. Any strategies to incorporate their feedback into a “good SQL” model would be highly useful.
- “Good SQL” Database Management: To manage reliable SQL generation, we’re considering building a structured “good SQL” database:
- User-Specific SQL: SQL marked as effective by individual users, though this has challenges, as user feedback can sometimes be inconsistent.
- Global Good SQL DB: A general repository of reliable SQL queries that can serve as a primary source for top-k matches when users query.
The ideal solution would allow us to first query a user’s SQL feedback before moving to the global “good SQL” database, helping to improve relevance. Any advice on managing these feedback databases effectively or implementing a robust matching mechanism for improved accuracy would be incredibly valuable.
If anyone has encountered similar issues in text-to-SQL solutions, especially in balancing accuracy improvements with latency constraints, I’d love to hear your insights!
Any help here will grear help…hoping for a brainstorming here on this topic which is very important for text2sql
You introduced three levels of data structures. a. data warehouse tables, b. consolidated views on top of datawarehouse tables c. tailored views “on top of same underlying tables”,
(i) It is not clear what the reference to table names is in the following context " We embed column names, table names, and descriptions in OpenSearch.". Since you are shielding the underlying tables, not sure why you would need to put table names and their relevant columns in opensearch.
Accuracy should be your first target. I define accuracy as the ability to return sematically correct SQL which, if run across an accessible database, produces results that are acceptable to the end user. Is that your definition of accuracy?
In my context, the ability to get to 95% accuracy (only 1 user query in 20 user queries returns results which are NOT acceptable to the end user) is pretty important.
In my context, what has worked is that only after getting to the relavant accuracy is it practical to get the latency down. The essential method of reducing latency is to store relevant SQLs with personalized recommendations in a vector store. Of course there’s a lot more that goes on behind the scenes.
I don’t rely on OpenSearch/ElasticSearch for getting to the tables/sql. I use LLMs (plural) to get to that promised 95%. The LLMs (gpt-4o series) have enormous exposure to the SQL driven use cases.
hth
Thanks ,can you elaborate more on what sections you have in your solution and how you use llms with them?
Might be interesting to know your approach as 95 percent is great accuracy
In broad strokes, to expose text2sql for business folks who are, shall we say, pretty finiky here’s the approach that I would take
-
describe the overall schema to LLMs
-
expose the poweruser queries to the LLM to generate meaningful text
– there are some power users who woud gladly expose their favoriote sqls
– there are others for which there’s always golden gate
-
get feedback on the text from the power users
-
the above three guaranteeing 100% correct semantic SQL and results that are self fulfiling; solving the seeding issue
– store those in a vector store
-
generate additional use cass through LLMs
– validate the SQL being generated for those use cases through recall of vector stores
– go to experts with the generated use cases and their validity
– store the valid examples back into the vector store
-
then moving along, target users in waves
– first power users that can cross–polinate data domains
– primary emphasis on semnaticcorrectness
– if intent unclear, then caution user on feedback and why it may be incorrect.
– incorporate user logs into to the intent to create new queries
…
Again as you can see, using LLMs for their native power (understanding business use cases to SQL is pretty important and SQL to business use case is relatively straightfoward; given a good description of schema )
After that is when you can focus on latency (a different beast altogether)
hth
Great…issue i am facing is how to build a prompt back or ambiguity clearing feature,small example ,
user:how much sales of baby diaper was done.? Ambiguity detector should ask back …
Do you need it for current year?available years are 2022 to current etc
Or
Which size diaper you need this data for?
Available sizes are x,y,z etc
Or
Suggests a similar question from users local favorite questions set else from global vetted question bank
My end users are not sql savy so trying to build a correct local favorite user question bank is looking tough…any thoughts?
Like I mentioned earlier, the LLMs have a TON of knowledge about data warehouses in general.
One thought is to decribe the schema and then provide summary stats for the distribution of different columns in different tables and then ask the LLMs for some typical queries (in text, not sql) that a user could have of of that data
Translate that back into semantically correct SQL and generate actual reports to be validated by user-experts. The key is solve the initial seeding issue and solve it well
hth
1 Like
Can you explain more on storing sql with personalized recommendations in vector store for latency…
Also my issue is how to ask for clarification,ambiguity checks
User asks:best movie in imdb
System should prompt back: do you mean in current year or you want year breakdown ?
Another 1
User:whats the most watched movie by Sam?
Now there can be many Sam…Sam/Samuel etc…how do we tackle this ambiguity via asking back