I have a large statistical DB (with different tables for different categories), each of them has slightly different columns but some are common such as the year, month, the observation value, and how they are categorized.
I am trying to make a feature where the user will enter a prompt (human language), and I will retrieve the answer + query the relevant results.
TL;DR:
- I just want to convert human language to a filter (a object containing the filtered columns and values {year: 2022, subject: whatever, …}, and then summarize the answer.
Here is my suggested workflow (and part of it is done as a PoC):
→ Question Parsing
- Identify Language
- Identify Table/View to target
- Retrieve AI instructions (This includes any special rules for an entity, I was thinking of storing special instructions for each table, correct me if this is wrong)
- Transforming values (For example always transform months to numbers [June → 6], etc.)
→ Entity Recognition|
- Find the entities from the prompt that will be used to filter the database (find the year, the month, the subject, etc).
→ Data Retrieval
- This step will use the entities from the previous step (I made a service connected to the SQL DB which selects based on the filtered columns).
- Identify any extra columns to always include in the response.
→ Answer generation using another hit on the GPT API
- I feed it the retrieved data (which could be large sometimes)
I need your suggestion on a robust solution, and to make this solution general for the whole community.
Should I finetune, or can I make use of the assistant and function calling? or what? Do I need to embed any data? Are there any other techniques?