Creating a Chatbot using the data stored in my huge database

But I have seen everywhere almost, the only thing I have seen embedding is used for is to do similarity searches and then may be some other similar tasks
but I am unable to find anywhere saying how embeddings will store use questions and answer based on the prev question by making the chat bot to a conversational bot

That’s the superpower of embeddings - similarity. You need to allow your mind to embrace this term without the “search” predicate.

Imagine a chat conversation that mentions a data “value” type (aka, a column in a database). But, such mention is not precisely the field name itself - it is an intimation that it is a field of data your system possesses. Instead of assuming that embeddings are limited to identifying data similarities, they can also be used to understand aggregation requirements with natural language conversations. e.g., the user has mentioned an interest in an analytic - what about my schema pertains to this?

This is one example of using embeddings to understand the interest of a human as it pertains to the data you have that the human may need. The popular understanding of how embeddings are employed is narrow and lacks innovation.

Imagine how embeddings might be used to …

  • Determine if an inference is more likely to be a hallucination as opposed to more factual.
  • Classify a query as being about (x) data vs (y) data.
  • Understand when humans are asking about things that have already been asked and which exist in the cache.
  • Reject certain queries to avoid prompt injections.
  • Continually reward a chained process for adjusting its compass heading as it seeks to complete a goal.

I’ve often intimated that AI solutions are nothing more than that which we have been doing for decades in computer science; embeddings are simply another decisive advance in the quest to make better applications. It’s your opportunity to be creative with that advantage.


I have found a solution actually But not sure if this works or not
What I wish to do is I will store the user previous questions in a vector db after embedding along with the response to those previous questions
Now when a user asks questions I will see the most related questions and ask GPT to reframe the user question using his previous questions so that it includes all the info in the current question
Now as usual I will send this to GPT to write a query

what do you all think does this work ?
anyone tried doing this way?

@abhi3hack This new feature from OpenAI might be helpful to you : OpenAI Platform – notice the part about sql_query – that might be helpful to you.

1 Like

Preach, Brother! Ain’t it the truth. I mean, if I just wanted a simple question in answer out chatbot, I would have been through months ago. It’s the what-ifs that get you bogged down. What if the user asks the question this way instead of that way? What if their question is more of a keyword search than a semantic search? What options can you give, besides “Sorry!”, if the search is not successful? I mean, it goes on and on and on…

I only skimmed through this thread, but from what I can see, the initial poster has skipped a lot of steps in the process. I mean, the learning process. I’ve seen a lot of people throwing out a lot of good suggestions and ideas, but it’s like making suggestions in Arabic when one only knows French.

So, I would suggest to @ abhi3hack to understand what an embedding is, and the difference between embedding and fine-tuning:

If you understand and decide embeddings are the way to go, take a look at this flowchart and make the time to understand what each step of the process is, and how it is accomplished.

And, if you don’t understand the flowchart, watch this tutorial, which steps through the same process:

Then, when you truly understand this entire “chatbot” “chat with your data” process, come back and ask: So, how can I do this without uploading my private data to a remote vector store. The answers you get back will make a lot more sense.

I’m no expert at this. I’ve got tons more to learn. I’m just sharing what I did to get to the point where I could competently code chat applications from point A to Z.


I’m not entirely sure if I understand your question, but it appears to be related to the difference between a keyword search (sql) and semantic search (AI).

They are two different animals, designed for two different results. I can tell you that trying to use AI (gpt-4, gpt3.5, etc…) to search for items by keyword is doomed to failure. These systems are designed to look for ideas, concepts, summarizations, interpretations which come as close to your query as they can get. They are not designed to locate information based upon specific words in a document.

In fact, I just recently posted an issue here where I had this exact same problem: How to format context documents to allow model to recognize specific fields within documents

So, if your video tags are titles, locations, authors, you know, specific information, you may be better off going with SQL. If your video tags, on the other hand, are free flowing descriptions, critiques, summarizations, then AI is the tool for the job.

Hope this helps.

1 Like

I have created a workflow and in the phase1 I want to generate accurate sql queries from natural language I have tried the embeddings approach

It has taken a long time for me to write those sql queries for usecases becuase My single sql query is of 550 tokens Because I need to include user_ID, multiple table joins, My query becomes complicated
So now I want to know with the experience of using Embeddings in some way

Is there any other better approach Because this is clearly not helping me, yes to an extent but my accuracy is only 60%
Provided I have asked similar questions which I am sending in the prompt but still it is making errors in writing query
Especially Syntax errors of ‘(’ as there are many brackets in my query because of some weird calculations

Can using langchain help me in building the accuracy ?
Please I need help in this…
@curt.kennedy @bill.french @SomebodySysop @EricGT @alden @nelson @jethro.adeniran

I have seen people doing just by giving table and column info I am still unable to understand how they are able to do it which I am unable to do

All I wish to do currently is just to make chatbot which can do natural language to sql by giving on not giving the database info
If giving database info fetches me good results I am okay with that also

So I need suggestions on how this is done please have a look at my flowchart and give inputs to build this bot

Do you need completeness in the data you want to process ?
For example a recommendation engine doesn’t necessarily need every single tv to suggest you a tv

Whereas if you need the revenue you made from tv you need all of them. I think your usecase will be interesting to understand. Then we can recommend an approach

Hey @heiko
Thanks for responding
let me make my use case clear then

I have a db which consists of my users performance history ( performance means basically all the actions they have made in the website like if they create a purchase order or if there are invoices created and all )

So if a user asks to give me my invoice details latest one or if he asks by any specific date on the chatbot I should be able to retrieve the data and analyze, and answer accordingly
Till now what I thought is to write a sql query to fetch the data since our db is a postgresql, so now making natural language to sql accurately became my priority

I think this should explain you what I am trying to build
One simple example would be
I want to build something likes this but this is very advanced currently I want to build a basic model later we will scale to that level

So now please provide some suggestions in achieving this…

Perfect. So you need a deterministic data retriever with a fuzzy NLU frontend. Right ?
I used the functions api in a similar case for that.
Create a functions that encapsulates your sql and returns the data you need.
You can even „force“ Gpt to go into a question/ answer mode to get the info from your customer that you need to call that function. If you gathered all the necessary info during the dialog the api will return you the parametrized function call
Does that help ?

1 Like

Yes exactly

You were referring to the below article right

I have a question in using this
How many functions I have to create like this writing so many functions for every use case gonna increase my token size right??

Just to add we have an analytics page where we show our user history in charts graphs etc.
For that we have written a single function depending on the filters user adds in the front end( example date of creation, item names etc) we modify the graph accordingly
But this function has parameters which are in json format which have the keys of table_column names now from user question how can this be extracted :confused:
on the top of it this function has only analytical data not all the user information, for all the information will it be possible to write a single function to fetch any type of info from db??
or else if we write multiple functions for every use case then what about token size?

Even if its done some how
But how to make this gpt conversational
Like it should ask for some specific information only right which are required for that function?

How can this be done can you share some resource or github repo or some medium article or something please…

I wanted to get back to you on this, but I see others have chimed in. I was also going to suggest looking at the “Function calling” API feature. You may also want to direct your queries to those who have successfully created links between OpenAI models and their internal SQL databases.

I also wanted to mention that Weaviate (a vector database company) has what it calls a “hybrid” system, where your queries can be a mix of semantic and keyword. This might help also.

But it does sound like being able to use the “Function” feature to generate the correct sql from semantic queries might just do the trick.

Good luck!

1 Like

Great suggestion, there are also some open source frameworks like langchain/llama-index that offer support for SQL/embedding etc. handling that will probably make everything even smoother.

1 Like

I have a doubt I am able to code without using langchain becuase at the end we need to just make a proper prompt and send to the openAI right
So what else langchain/llama-index offers me which I cannot do by myself

One more issue is that we cannot ask openAI to do NL-SQL just by giving our db schema becuase the db as mentioned in the topic title its very huge with multiple tables with columns having jsonb format fields and to answer a single question 2 to 3 tables are being joined
Now in this case do you really think langchain/llama-index helps me
Then can you please share some resource for me to learn
I did my re search in this week but could not find any effective solution, I hope you can help me

If you want to see how we are currently trying to do this I have explained it here

Please do suggest a better approach

We are currently interacting with OpenAI directly through API requests
I did not think of directing our queries
But I don’t understand why will I direct my query if I have generated it I will just execute it and see the results right?
How can functions be used to convert NL to SQL I have added my questions in using it can you please look at them once

Can you please say how to approach this I have a doubt in using functions???

Outside of my sphere of knowledge. My use case does not require the use of functions, so I’ve not looked into them closely. But, there are plenty of individuals on this forum who have. You may want to post a new question focused on just how to use functions in your use case.

Thanks for sharing, I read through the post and benefit a lot from your posts.
I am building a chatbot system for database recently, and I faced the similar issues as OP did, especially for the “Sodium hydroxide (NaOH) & Sodium chloride (NaCl)” problem. Not only do I have a huge database, but also I get multiple tables with similar table names and column names as well.
Embedding is not working to get the precise table. It seems I need to handcraft a lot of guidelines in the prompt, which are not clear since there are various of cases cannot be listed thoroughly. Wondering how should I proceed further?

@abhi3hack I am facing the exact same issue. It turns out to be that function calling using gpt-3.5-turbo-0613/gpt-4-0613 is the only nearest solution right now to this problem or in general for databases.

    messages=[{"role": "user", "content": f"{context}"}],
function_call={"name": "xxx"},

Something like this is definitely going to help you. Only difference in our cases is that I don’t have multiple tables. I’m using single high dimensional table.

Please let me know if you have already done it or if you have a better solution.

Here is a “blank” bot (and the app is on the play store for free real AI blue alien head) convert the files to pdf or txt and feed it into the brain… Do it in small chunks … this is t my app so all respects and rights to the creator… GitHub - 0m364/parrotAI: Fork of real ai android app .. will be attempting some one shot training weights