Need pointers from expert for building a robust text to sql conversion on 20 datasets with 2000 plus attributes

Hi Experts
There has been lots of discussion around this text to sql but none of then have concluded on any robust ,reliable,svalable solution,can you guys please help out in defining such design,this is definitely going to help out lot of folks out here

1 Like

gpt 4 0125 or 1106 should be up to the task if you do it in small chunks :thinking:

what have you tried so far?

I have tried supplying a huge prompt with table ,column description ,column distinct values
Also many few shot examples
So much text that it feels like we are trying to write all possible user query
This is resulting in huge token cost
Next i am not sure what to do
Planning to create views on top of all dataset with meaningful view names and column names
Intent detection and entity relation before feeding the table definition column name etc plus the few shot to rag
Will that help?

Yeah, identifying tables by intent would be a good start, you can use vectors or keywords for that. I imagine you don’t need to have all tables loaded in context to answer a query.

And views? Depends. You can have views be an optional feature; if you have a high query match on a view you could return it, but I’m wondering if it’s not better to just scrap views altogether, because they’re just redundant information. Unless you use them for access control, but then you don’t need the fundamental tables.

The trick is to include as little information as possible, while still being sufficient and obvious to answer the query.

My base tables have garbage column names so was thinking of creating view with more meanigful name,was thinking that will give me a better control later if i want to change any name
So instead of supplying underlying table name instead control the column names with view

How would you vision this kind of ask end to end,considering we need some kind of query checker as well before running it against db
Please suggest

That’s not a bad idea. I assumed you wanted to throw all your stuff at the AI and hope for the best. Cleaning up your stuff does help.

It really depends on your data. Without looking at it I can’t say anything concrete. I’d just generally go with the standard

embed(table context) → search/retrieval + instruction → sql generation

Any particular embedding model you will suggest?
Also any vector db suggestion which can go well here.
Will opensearch,elastic be a good choice?
With many columns etc how would you suggest we do
Is this good flow or even possible
1.View creation with clean names
2.intent detection based on user query-how can we do this efficiently
3. RAG based on table definitions and many 1 shot examples or 1 shot example queries should be separate?
4. Similarity search on user question+intent+sys prompt to get to correct entities which are table and columns
5. Sql generation

Can you please let me knowif this flow seems ok?

Any pointers here will be highly appreciated ,many people are trying to do the same,will be great if everyone know the options

Hi, sorry.

I wasn’t sure how to answer your question - particularly because it seems like you’re asking us to do your engineering work for you.

In general, sure, sorta, what you’re proposing can work. You’ll need to evaluate that stuff on your actual data and use-cases to see how well it works.

If it’s too much work, consider hiring a consultant? :thinking:

But I’d suggest just getting started and trying to see what works and familiarizing yourself with the ecosystem, I’m sure you can do it! There’s nothing wrong with getting it slightly wrong the first time, it’s just part of the work!

I am sorry if it sounded like this,we have a strategy,just wanted to validate it…or have ideas from someone who has done similar use case in the past

1 Like