I am trying to convert the natural language into SQL queries. Could you please provide some common problems in the conversion?
It’s actually not that complex of a problem to solve.
If you provide the database columns and a small description of the data that is stored in the columns plus a sample query or two at temperature 0, you should be able to get really good accuracy for SQL query generation. The trick lies in finding what exactly to put in the context and what to put into the prompt
Thanks @udm17 .
But I want to get the same query for multiple type of questions.
- How many customers have logged-in in last 30 days.
2)How many customers active in last 30 days.
In this case for the above two questions I want to get the same query.
Can we use embedding to solve this issue?
Can you elaborate a bit more on how you would use the embeddings ?
Based on my trials with trying to generate code/cli, with GPT’s understanding, you should not have a problem with the same query being generated for different questions as long as the semantics and intent of the questions are the same
That is the whole point, I would think, of using natural language to create sql queries. This is basically how you do it: https://platform.openai.com/examples/default-sql-translate
Pretty simple. But, what I do in addition to providing the table layout is also provide more detailed field descriptions to help the model understand how they can be used to answer questions.
For example, if your last login date field is lastLogin, describe it as “last login activity date” or something like that. That way the model will know this field can be used to answer both queries.
Natural language is inherently ambiguous. For example, “How many customers have logged in in the last 30 days?” might imply counting logins, but it could also mean counting unique customers.
not really that ambiguous an example. simple not negation.
“How many customers have NOT logged in in the last 30 days?”
or get really edgy
“How many NOT customers have logged in in the last 30 days?”
Your right about clarity. Next point, I think that embedding is not the solution. It’s just a fix at the end of the problem and it will add more burden to the model. The solution probably lies in the architecture of the model.
a few thoughts…
are you taking an out of the box llm and looking for NL2SQL/text2sql? are you looking for the model to be aware of a specific database/schema/data/etc. that you are looking to generate sql for? What is the complexity of sql you are looking to generate.
A base model may only give you a certain percentage of efficacy and accuracy and will not be aware of your specific schema, data, verbiage etc. You may have to fine-tune the model (Peft - LoRA, QLoRA) to make it more intelligent on sql and specific datasets you are looking to write sql against. This should increase the efficacy of the outputs. In addition you may need to use prompt engineering and RAG. Alternatively full tuning may be an option but is an expensive compute operation which requires readjusting the weights of base model. Fine-tuning is an alternative to full tuning where some of the weights from weight matrix maybe adjusted and is less compute intensive. OpenAI released GPT’s which are trained and intelligent on a specific, so a GPT for SQL will be something similar where it is primarily trained for the purpose of generating sql. One can go for further where gpt is further trained on an organizations schema and data and works proficiently to generate sql for a particular organization.