Creating a Chatbot using the data stored in my huge database

I am actually doing something similar myself, but because I’ve only a couple of tables, I am able to send them in the system message.

Your approach is sound. I believe, beyond the schema, you will need to describe the overall functionally of each table as well as the specific functionality of each field. And, since you are using embeddings, I would be as detailed about this as possible. Why? Because it’s the vector store that will need to retrieve the correct tables based upon the user’s request. The better your table descriptions, the more likely it will retrieve the correct table schemas. The LLM creating the SQL is the easy part – if it has the requisite table info.

Also, I would keep tables that you expect to be used together frequently as close together in your embeddings as you can (same document). That will help things along as well.


Thank you for your reply. Yes consulting clock wont be affordable for me as of now :stuck_out_tongue_closed_eyes:

But can you please help me validate the approach? I am using llama index for this. If langchain has better functionality, then I dont mind using langchain.

Thank you so much, If possible can you please give me some insights on the solution you are using?

I followed the example here:

So, in each call to the model, I send a system message like this:

Given the following SQL tables, your job is to write an SQL query given the user’s request. You may not write queries that contain CREATE, INSERT, UPDATE, DROP, DELETE, ALTER, RENAME, TRUNCATE or any command that alters the database structure or data in any way. Only return one SQL statement. No comments. There are only two tables: solrai_log uses the logUser field and solrai_user uses the username field. Only these two tables ae to be used. Also remember that a ‘query’ is a logType. So, the term ‘queries’ always refers to logType. Use the field titles for headers instead of the field names, but use the field names for the commands.

solrai_log table

logId: int unsigned, 10
title: Log ID
description: The unique integer that identifies this log record.
logUser: varchar, 25
title: Username
description: The user account associated with this log record. “system” means this record was created by a system process.
logIp: varchar, 25
title: IP
description: The IP address of the person or entity generating this log record. Generally not applicable to “system” processes.

Only return an SQL statement. No comments. Generate response ONLY in SQL. Return no text other than the executable SQL.

And the user input.

Simple, but works surprisingly well.

“Select into” … might be a thing.

You should never dissallow things. Only allow.

Totally not true. Well, in my world, totally not true. Can’t speak for anyone else.

These calls are only made through a read only connection. I also did not mention that I have a second system message that applies after the SQL statement is created. Just wanted to give a simple example here.

Extra safety never hurt.
Then a prompt “ONLY allow SELECT statements to read data. The database has only read permissions” might work more accurately.

I built a Bot using GPT3.5 for the same purpose. For small number of tables, it provided very high accuracy. In your case, I would work on the data side. If your data is from OLTP, ETL the data to a DWH that’s more suitable for data analysis. Build data marts for business function areas. The Bot checks the data marts using business logics in the back end rather than the entire tables. This approach reduces your number of tables dramatically. To further cut down the number of tables, use views to join some tables in a data mart and have the Bot query views.

1 Like

Welcome to the developer community.
Great first post!
Data(base) experts are very welcome!

Hi @nelson , can i use Embedding on the SQL dataset ? is there any link to which works on sql server management studio to create a chatbot with the db, can you share details on that?

Summary created by AI.

The discussion in the thread revolves around the concept of building a custom chatbot that can answer queries based on data available in a database. Abhi3hack initiated the discussion by expressing the issues faced while creating such a system, predominantly using OpenAI GPT-3.5 and a PostgreSQL database. He shared two trial methods, one using langchain and one without. Yet, encounters problems with token size in the case of langchain. He also expressed concerns about fine tuning and embeddings, unfamiliar with how embeddings work and worried about user privacy due to the potential requirement to provide all database information.

Muindemwanzia919 shared a similar problem with a goal of creating a chatbot based on data stored in Google Drive. In response to this, Abhi3hack asked about the approach muindemwanzia919 was planning to use.

EricGT pointed Abhi3hack to a video about building systems with the ChatGPT API, and Abhi3hack found it relevant to his needs.

Muindemwanzia919 shared their code, which involved accessing a Google Drive and answering queries based on documents stored in it, and asked for suggestions on how it could be improved.

Bil.french acknowledged abhi3hack’s query regarding privacy concerns with fine tuning or embedding and proposed checking an open-source tool called pgvector. He emphasized understanding the project before deciding the approach.

Abhi3hack clarified that the goal is to build a chatbot that can answer user queries based on the analytics data in the database, similar to what Hubspot has accomplished.

Nelson provided three potential solutions to overcome the limitations of the Large Language Model (LLM). They included managing the input length, using an embedding data storage system, and fine-tuning the model with custom data. He also mentioned his project, Superinsight, which could be used to create a chatbot.

Abhi3hack was unclear about Nelson’s suggestions and asked further questions regarding how to effectively reduce token size, the functionality of embeddings with numerical data, and user’s privacy concern with fine tuning.

Following more discussions about best methods of data presentation and privacy SomebodySysop emphasized the necessity for abhi3hack to understand embedding and the difference between embedding and fine-tuning, recommending several resources. They added that abhi3hack would be able to understand the answers to their questions better once they understood these concepts.

In his later posts, SomebodySysop provided insights on the difference between a keyword search (sql) and semantic search (AI) and suggested their suitability under different requirements, clearly identifying the advanced capabilities and limitations of AI.

Summarized with AI on Nov 24 2023
AI used: gpt-4-32k