Seeking Guidance on Building a ChatGPT-Style Data Analyst Tool with Database Integration

Hey everyone,

I’m currently diving into a project aimed at crafting a tool akin to a data analyst, inspired by ChatGPT’s capabilities. The twist is, instead of relying on uploading CSV files, this tool would directly connect to databases, allowing users to ask complex questions about:

  • Aggregate statistics (like total revenue per account, average deal size)
  • Time-based insights (such as the age of an account or the time since the last transaction)
  • Categorical data interpretations for non-numeric fields (industry type, account region, etc.)
  • Custom calculations or specific business logic for deeper analysis

I started experimenting with an approach where GPT generates PostgreSQL queries, executed through the pg npm module. Initially, it seemed promising, but I’ve hit a snag considering:

  • The challenge of handling databases with 40-50 tables, each containing upwards of 20 fields. Detailing the schema in the prompt quickly becomes impractical due to size constraints.
  • The token limit makes it difficult to employ embeddings effectively for this purpose.

I’m reaching out to see if anyone could offer guidance or point me towards open-source projects that tackle similar challenges. Any help or suggestions would be greatly appreciated!

1 Like

Interacting with an SQL database using an LLM (.i.e Chat with your SQL Data) has become a common application and there are a few people doing it a few different way. Below are two examples that might be helpful as architectural guides on how to implement your system.

LangChain Cookbook - Retrieval with SQL

Vanna - RAG (Retrieval-Augmented Generation) framework for SQL generation and related functionality

Thank you for your earlier help; it was really useful. I have one more thing to ask. Have you used Vanna AI?

I tried it by training my database like the guide said, but the guide doesn’t really explain what to do when there are many tables and relationships in the database.

Can you help me with this? It works fine when I’m just dealing with one table, but what about when I need to make more complex queries that involve more than one table? Do you know any easy way to do this?

Glad to hear that the recommendations were useful. I’m not an expert in Vanna, but that project has a community forum just like this one that you can use to ask questions.

Hi @kanzariyamihir,

I successfully connected Postgres database with ChatGPT using custom actions. ChatGPT generates the SQL queries to provide answer for my questions. SQL queries were send to Python notebook that was available online in Mercury Cloud. Notebook runs SQL query in the database, fetch results, and returns as Markdown table for ChatGPT. Here is my full Python notebook:

Below is example prompt and request generated with ChatGPT:

It was really nice experience to chat with my database :slight_smile: You can find the detail description of my steps in ChatGPT builder create Action to query database

I have created a few GPTs that connect to databases like AITable AirTable.

I use the Data Analyst tool quite a bit using the GPT interface and it can do amazing stuff. I am guessing you will be using GPT API’s to do the calling using some programming shell.

I have seen this done on a consulting project using the following approach and it seemed to have worked. Architecture was Django + Python+ SQL server. Should work for any architecture.

Step1 : Send the schema for the tables you want to use to GPT and ask it to read and understand the schema. We did that by running SQL query that generates the schema on the fly and sends it GPT.

Step2: Send the question you want answered and ask GPT to use the schema to provide the SQL that would generate the answer.

Step 3: Run the SQL and send the result set back to GPT + the user question again and ask for a answer to the question.

Step4: Show the answer back to the user.

Hopefully this helps you get closer to your goal.
Thanks

1 Like

I want to train t5 small model in offline mode, so It can perform text to SQL task. I started from a single table with data and another table where 200 plus records of queries and corresponding text along with context are saved. But after training I could not get the desired results. Someone please guide me