Seeking Guidance on Building a ChatGPT-Style Data Analyst Tool with Database Integration

Hey everyone,

I’m currently diving into a project aimed at crafting a tool akin to a data analyst, inspired by ChatGPT’s capabilities. The twist is, instead of relying on uploading CSV files, this tool would directly connect to databases, allowing users to ask complex questions about:

  • Aggregate statistics (like total revenue per account, average deal size)
  • Time-based insights (such as the age of an account or the time since the last transaction)
  • Categorical data interpretations for non-numeric fields (industry type, account region, etc.)
  • Custom calculations or specific business logic for deeper analysis

I started experimenting with an approach where GPT generates PostgreSQL queries, executed through the pg npm module. Initially, it seemed promising, but I’ve hit a snag considering:

  • The challenge of handling databases with 40-50 tables, each containing upwards of 20 fields. Detailing the schema in the prompt quickly becomes impractical due to size constraints.
  • The token limit makes it difficult to employ embeddings effectively for this purpose.

I’m reaching out to see if anyone could offer guidance or point me towards open-source projects that tackle similar challenges. Any help or suggestions would be greatly appreciated!

Interacting with an SQL database using an LLM (.i.e Chat with your SQL Data) has become a common application and there are a few people doing it a few different way. Below are two examples that might be helpful as architectural guides on how to implement your system.

LangChain Cookbook - Retrieval with SQL

Vanna - RAG (Retrieval-Augmented Generation) framework for SQL generation and related functionality

Thank you for your earlier help; it was really useful. I have one more thing to ask. Have you used Vanna AI?

I tried it by training my database like the guide said, but the guide doesn’t really explain what to do when there are many tables and relationships in the database.

Can you help me with this? It works fine when I’m just dealing with one table, but what about when I need to make more complex queries that involve more than one table? Do you know any easy way to do this?

Glad to hear that the recommendations were useful. I’m not an expert in Vanna, but that project has a community forum just like this one that you can use to ask questions.