Creating a Chatbot using the data stored in my huge database

I have created a workflow and in the phase1 I want to generate accurate sql queries from natural language I have tried the embeddings approach

It has taken a long time for me to write those sql queries for usecases becuase My single sql query is of 550 tokens Because I need to include user_ID, multiple table joins, My query becomes complicated
So now I want to know with the experience of using Embeddings in some way

Is there any other better approach Because this is clearly not helping me, yes to an extent but my accuracy is only 60%
Provided I have asked similar questions which I am sending in the prompt but still it is making errors in writing query
Especially Syntax errors of ‘(’ as there are many brackets in my query because of some weird calculations

Can using langchain help me in building the accuracy ?
Please I need help in this…
@curt.kennedy @bill.french @SomebodySysop @EricGT @alden @nelson @jethro.adeniran

I have seen people doing just by giving table and column info I am still unable to understand how they are able to do it which I am unable to do

All I wish to do currently is just to make chatbot which can do natural language to sql by giving on not giving the database info
If giving database info fetches me good results I am okay with that also

So I need suggestions on how this is done please have a look at my flowchart and give inputs to build this bot

Do you need completeness in the data you want to process ?
For example a recommendation engine doesn’t necessarily need every single tv to suggest you a tv

Whereas if you need the revenue you made from tv you need all of them. I think your usecase will be interesting to understand. Then we can recommend an approach

Hey @heiko
Thanks for responding
let me make my use case clear then

I have a db which consists of my users performance history ( performance means basically all the actions they have made in the website like if they create a purchase order or if there are invoices created and all )

So if a user asks to give me my invoice details latest one or if he asks by any specific date on the chatbot I should be able to retrieve the data and analyze, and answer accordingly
Till now what I thought is to write a sql query to fetch the data since our db is a postgresql, so now making natural language to sql accurately became my priority

I think this should explain you what I am trying to build
One simple example would be https://chatspot.ai/
I want to build something likes this but this is very advanced currently I want to build a basic model later we will scale to that level

So now please provide some suggestions in achieving this…

Perfect. So you need a deterministic data retriever with a fuzzy NLU frontend. Right ?
I used the functions api in a similar case for that.
Create a functions that encapsulates your sql and returns the data you need.
You can even „force“ Gpt to go into a question/ answer mode to get the info from your customer that you need to call that function. If you gathered all the necessary info during the dialog the api will return you the parametrized function call
Does that help ?

1 Like

Yes exactly

You were referring to the below article right

I have a question in using this
How many functions I have to create like this writing so many functions for every use case gonna increase my token size right??

Just to add we have an analytics page where we show our user history in charts graphs etc.
For that we have written a single function depending on the filters user adds in the front end( example date of creation, item names etc) we modify the graph accordingly
But this function has parameters which are in json format which have the keys of table_column names now from user question how can this be extracted :confused:
on the top of it this function has only analytical data not all the user information, for all the information will it be possible to write a single function to fetch any type of info from db??
or else if we write multiple functions for every use case then what about token size?

Even if its done some how
But how to make this gpt conversational
Like it should ask for some specific information only right which are required for that function?

How can this be done can you share some resource or github repo or some medium article or something please…

I wanted to get back to you on this, but I see others have chimed in. I was also going to suggest looking at the “Function calling” API feature. You may also want to direct your queries to those who have successfully created links between OpenAI models and their internal SQL databases.

I also wanted to mention that Weaviate (a vector database company) has what it calls a “hybrid” system, where your queries can be a mix of semantic and keyword. This might help also.

But it does sound like being able to use the “Function” feature to generate the correct sql from semantic queries might just do the trick.

Good luck!

1 Like

Great suggestion, there are also some open source frameworks like langchain/llama-index that offer support for SQL/embedding etc. handling that will probably make everything even smoother.

1 Like

I have a doubt I am able to code without using langchain becuase at the end we need to just make a proper prompt and send to the openAI right
So what else langchain/llama-index offers me which I cannot do by myself

One more issue is that we cannot ask openAI to do NL-SQL just by giving our db schema becuase the db as mentioned in the topic title its very huge with multiple tables with columns having jsonb format fields and to answer a single question 2 to 3 tables are being joined
Now in this case do you really think langchain/llama-index helps me
Then can you please share some resource for me to learn
I did my re search in this week but could not find any effective solution, I hope you can help me

If you want to see how we are currently trying to do this I have explained it here

Please do suggest a better approach

We are currently interacting with OpenAI directly through API requests
I did not think of directing our queries
But I don’t understand why will I direct my query if I have generated it I will just execute it and see the results right?
How can functions be used to convert NL to SQL I have added my questions in using it can you please look at them once

Can you please say how to approach this I have a doubt in using functions???

Outside of my sphere of knowledge. My use case does not require the use of functions, so I’ve not looked into them closely. But, there are plenty of individuals on this forum who have. You may want to post a new question focused on just how to use functions in your use case.

Thanks for sharing, I read through the post and benefit a lot from your posts.
I am building a chatbot system for database recently, and I faced the similar issues as OP did, especially for the “Sodium hydroxide (NaOH) & Sodium chloride (NaCl)” problem. Not only do I have a huge database, but also I get multiple tables with similar table names and column names as well.
Embedding is not working to get the precise table. It seems I need to handcraft a lot of guidelines in the prompt, which are not clear since there are various of cases cannot be listed thoroughly. Wondering how should I proceed further?

@abhi3hack I am facing the exact same issue. It turns out to be that function calling using gpt-3.5-turbo-0613/gpt-4-0613 is the only nearest solution right now to this problem or in general for databases.

completion=openai.ChatCompletion.create(
    model="gpt-4-0613",
    messages=[{"role": "user", "content": f"{context}"}],
    functions=function,
function_call={"name": "xxx"},
)
print(completion)

Something like this is definitely going to help you. Only difference in our cases is that I don’t have multiple tables. I’m using single high dimensional table.

Please let me know if you have already done it or if you have a better solution.

Here is a “blank” bot (and the app is on the play store for free real AI blue alien head) convert the files to pdf or txt and feed it into the brain… Do it in small chunks … this is t my app so all respects and rights to the creator… GitHub - 0m364/parrotAI: Fork of real ai android app .. will be attempting some one shot training weights

Hello @abhi3hack – how’s it going with this project? Did you have any luck with this? (did you use the pandasai approach to get this done?

I finally got around to doing Text to SQL, and it works! However, how well it will work with a multitude of tables using scientific notations, I can’t say. You’d have to test it out.

Follow this basic example here: https://platform.openai.com/examples/default-sql-translate

Pretty simple to set up. You can even do it in the playground (to see if it will generate the correct SQL statements). Try it with a couple of tables and see how well it works.

I’m not using functions, but I have started using Text to SQL: https://platform.openai.com/examples/default-sql-translate

If you haven’t already done so, try duplicating the example with a couple of your tables. Don’t use CREATE statements, just give it your table layout with the actual fieldnames and descriptions.

Something like this:


solrai_log table

logId: int unsigned, 10
 title: Log ID
 description: The unique integer that identifies this log record.
logUser: varchar, 25
 title: Username
 description: The user account associated with this log record.  "system" means this record was created by a system process.
logIp: varchar, 25
 title: IP
 description: The IP address of the person or entity generating this log record.  Generally not applicable to "system" processes.
logDate: varchar, 20
 ​​​​​​​title: Date
 description: Date and time log record is created.  This is the date format: 1. **ISO 8601 Format**: `2023-08-28T00:00:00`
logType: varchar, 25
 ​​​​​​​title: Type
 description: Type of log record.
  ​​​​​​​Available types:
​​​​​​​    query - query to model that normally contains the question asked and the response returned by model
    update - update of embedded (vectorized) object
    insert - insert of new embedded (vectorized) object
    ignored - process of embedded object ignored
    rank - log record ranked and ranking inserted into logRank field
    categorization - category selected for logSubject field
logSolrId: varchar, 50
 ​​​​​​​title: SolrID
 description: Solr Id of this log record, if it is an embedded object
logNid: int unsigned, 10
 ​​​​​​​title: NID
 description: The node ID that the embedded object of this record is associated with.
logModel: varchar, 40
 ​​​​​​​title: Model
 description: The LLM (large language model) used for the processing of this log record.  If it involves an embedded object and no calls are made to an LLM, then "Solr" is used as the model name as the record involves processing with the Weaviate "Solr" classname.
logTokens: int unsigned, 10
 ​​​​​​​title: Tokens
 description: Tokens used in the processing of this log record.
logDesc: longtext, 2147483647
 ​​​​​​​title: Description
 description: The description of processing executed for this log record.  If it is a query, this will contain the Question asked by the user and the Response received.  If it is an embedding process, it will contain details about the embedded object such as titie, dsid, timestamp, etc...
logNotes: varchar, 2500
 ​​​​​​​title: Notes
 description: Notes associated with this processing.
logTime: decimal, 10
 ​​​​​​​title: Elapsed Time
 description: The elapsed time required to complete this process.
logEmail: varchar, 80
 ​​​​​​​title: Email
 description: Email address of user associated with this record.
logSource: varchar, 10
​​​​​​​ title: Source
 description: This is the user source of the process, usually a "query" process.
  ​​​​​​​Known sources so far:
    ​​​​​​​docs - documentation query
    site - query executed on site
    email -  query executed via email
    sms -  query executed via sms
logRank: int unsigned, 10
 ​​​​​​​title: Rank
 description: Ranking of query, integer between 0 and 10.  Null means it has not yet been ranked.
logUrls: text, 65535
 ​​​​​​​title: URLs
 description: Full HTML URLs returned on query.
logSubject: varchar, 100
 ​​​​​​​title: Category
 description: Category title assigned to this query record.

It’s really easy. You can test this in the playground to see if it comes up with workable SQL. If so, then you just need to come up with a system message that defines your table(s) and start firing questions at the API.

Hope this helps.

Hello @abhi3hack ,

I see you’re working on a custom chatbot for your database and have encountered some challenges with prompt size and data privacy. Here are some brief suggestions:

  1. Prompt Size: Consider breaking down complex interactions into smaller steps to manage prompt size more effectively.
  2. Privacy: Explore privacy-friendly techniques like federated learning or differential privacy to protect user data during model training.
  3. Expertise: Collaborate with AI experts or data scientists for guidance in designing an effective chatbot architecture tailored to your organization’s needs.

Building a database-interacting chatbot can be complex, but with the right approach, you can address these challenges and create a valuable solution. Feel free to ask for more specific help if needed!

Hi,

I am also facing same problem when working with big PSQL database having more than 200 tables and data size is around 4-5GB’s.

@bill.french @abhi3hack can you please check if the approach I have written below is correct or not? Struggling with this problem since long time and despite trying multiple solutions, not able to solve the use case.

1. Generate the document of all table schemas as follows:
	a. For each table:
		i. Table Name
		ii. Create statement DDL
		iii. Data type and explaination using OpenAI
		iv. Example Queries in Dict Format
			1) {"Problem Statement/Use Case":"Query"}
		v. Explaination of Table about primary key and foreign key 
2. Embedd the document into vector store
3. Once user asks the query, first search the vector store about the tables and columns required 
         to solve the user query.
4. Pass these tables along with custom prompt and user query again to LLM to generate the SQL 
         Query. 
5. Execute the SQL query on database
	a. If error:
		i. Max retries is 5. if retries goes beyond 5 then return Error output.
		ii. Ask LLM to correct the query by providing the SQL error. This can be done by SQL Agent 
                         in langchain. 
	b. If success:
		i. We have to send the data to LLM for responding user in Natural Language.
		ii. Pass the data from SQL (rows and columns i.e., table data) to tokenizer and check if 
                         tokens are more than model token limit. 
			1) If token limit exceeded then trim the data under token limit and send it in chunks
                                 and then curate the data in final step. 
			2) If output data is under token limit, then pass the data directly to LLM to get the NL 
                                  output.
2 Likes

Hi @rushabh.sudame and welcome to the community!

There’s not enough information to assess your approach. Are you using LangChain? If you aren’t, I recommend you do because it will help you understand if you are doing this right. Any deeper inspection would require a put you on a consulting clock, and you don’t want that. :wink: