Hi everyone,
I’m new to training custom GPT models and would appreciate some guidance on the best approach for my project.
Problem: I’m working with e-commerce data and planning a data pipeline to ingest relational database data into BigQuery. My intention is to denormalize all tables into a single table that consolidates all the data. This pipeline will be scheduled daily to fetch incremental updates and load them into BigQuery.
Goal: I want to build an interface where business owners can ask questions in natural language, and a GPT model will use the BigQuery dataset to provide answers. Questions might include “What were the total sales of ‘hairclips’ yesterday?” or “Show me the trend in total sales value for the last 12 months.”
Proposed Approach: Currently, I’m considering training the GPT model to generate SQL queries based on the questions it receives. For example, if asked a question, the model would generate a SQL query tailored to the denormalized table in BigQuery. I would then execute this query in BigQuery and return the result.
I’m unsure if this is the most effective approach. Ideally, I’d like the GPT model to directly analyze the data and provide answers without generating SQL queries on a table that it doesn’t even know about. However, providing the entire dataset in the prompt isn’t feasible due to token limits and data size.
Could someone please advise on the best approach for such a use case?
Is it good to go with a denormalized table approach in BigQuery
and in regards to training the GPT model as well, what’s the best approach for such a use case?
Thank you!