Creating a RAG solution from a SQL DB with Summarization

I have a large statistical DB (with different tables for different categories), each of them has slightly different columns but some are common such as the year, month, the observation value, and how they are categorized.

I am trying to make a feature where the user will enter a prompt (human language), and I will retrieve the answer + query the relevant results.

TL;DR:

  • I just want to convert human language to a filter (a object containing the filtered columns and values {year: 2022, subject: whatever, …}, and then summarize the answer.

Here is my suggested workflow (and part of it is done as a PoC):

→ Question Parsing

  • Identify Language
  • Identify Table/View to target
  • Retrieve AI instructions (This includes any special rules for an entity, I was thinking of storing special instructions for each table, correct me if this is wrong)
  • Transforming values (For example always transform months to numbers [June → 6], etc.)

→ Entity Recognition|

  • Find the entities from the prompt that will be used to filter the database (find the year, the month, the subject, etc).

→ Data Retrieval

  • This step will use the entities from the previous step (I made a service connected to the SQL DB which selects based on the filtered columns).
  • Identify any extra columns to always include in the response.

→ Answer generation using another hit on the GPT API

  • I feed it the retrieved data (which could be large sometimes)

I need your suggestion on a robust solution, and to make this solution general for the whole community.
Should I finetune, or can I make use of the assistant and function calling? or what? Do I need to embed any data? Are there any other techniques?

To approach different datasets, I would have multiple agents.

For instance:
Agent 1 accepts the question. It has a high level understanding of the different datasets and sends the question (possibly refined question) to a sub agent that it thinks has a dataset that will answer the user question.

This next agent’s purpose is to generate SQL to query the dataset it was assigned to. It has knowledge of all fields in the dataset as well as an overview of context around the data.

Thanks Kevin,
What do you mean by agent? normal API calls?

Currently I am doing this, but it doesn’t seem right.
Also sometimes the data I get back from the db is large to feed it back to the model again.

Please explain further, with thanks.