Creating a RAG solution from a SQL DB with Summarization

realsoft · August 4, 2024, 6:37am

I have a large statistical DB (with different tables for different categories), each of them has slightly different columns but some are common such as the year, month, the observation value, and how they are categorized.

I am trying to make a feature where the user will enter a prompt (human language), and I will retrieve the answer + query the relevant results.

TL;DR:

I just want to convert human language to a filter (a object containing the filtered columns and values {year: 2022, subject: whatever, …}, and then summarize the answer.

Here is my suggested workflow (and part of it is done as a PoC):

→ Question Parsing

Identify Language
Identify Table/View to target
Retrieve AI instructions (This includes any special rules for an entity, I was thinking of storing special instructions for each table, correct me if this is wrong)
Transforming values (For example always transform months to numbers [June → 6], etc.)

→ Entity Recognition|

Find the entities from the prompt that will be used to filter the database (find the year, the month, the subject, etc).

→ Data Retrieval

This step will use the entities from the previous step (I made a service connected to the SQL DB which selects based on the filtered columns).
Identify any extra columns to always include in the response.

→ Answer generation using another hit on the GPT API

I feed it the retrieved data (which could be large sometimes)

I need your suggestion on a robust solution, and to make this solution general for the whole community.
Should I finetune, or can I make use of the assistant and function calling? or what? Do I need to embed any data? Are there any other techniques?

KevinDowling · August 5, 2024, 7:44pm

To approach different datasets, I would have multiple agents.

For instance:
Agent 1 accepts the question. It has a high level understanding of the different datasets and sends the question (possibly refined question) to a sub agent that it thinks has a dataset that will answer the user question.

This next agent’s purpose is to generate SQL to query the dataset it was assigned to. It has knowledge of all fields in the dataset as well as an overview of context around the data.

realsoft · August 5, 2024, 9:01pm

Thanks Kevin,
What do you mean by agent? normal API calls?

Currently I am doing this, but it doesn’t seem right.
Also sometimes the data I get back from the db is large to feed it back to the model again.

Please explain further, with thanks.

Topic		Replies	Views
Seeking Guidance on Building a ChatGPT-Style Data Analyst Tool with Database Integration Plugins / Actions builders gpt-4 , chatgpt , api , openai	11	3189	September 23, 2024
Text-to-SQL or Text-to-Dataframe Community gpt-4	0	78	July 28, 2024
Generate SQL queries combining prompt engineering and fine-tuning API	4	6986	December 24, 2023
GPT Summarizing Database Tables API	6	3283	July 31, 2023
Creating an RAG service using the API based on data in files Community gpt-4 , chatgpt , api , lost-user , assistants-api	0	290	August 4, 2024

Creating a RAG solution from a SQL DB with Summarization

Related topics