Project ideas guidance - using GPT to answer question based on private data

Hey all,

Not sure if this threat is in the correct category but looking for a bit of guidance nonetheless.

As a .net engineer looking to learn a bit of Python, I’d like to begin work on a side project that incorporates Open AI GPT.

As a keen runner, I plan to pull in data from Strava API and store in some sort of DB of my own. I then would like to use this data as a private source to answer questions via GPT such as:

  • What is my fastest 5km time?
  • What is my average distance per run over the last year?
  • How many runs have been over 10km in the last year?

It’s a simple idea but my question is how would you suggest I use the data to source the answers to my example questions? Would I connect GPT directly to my database or would I read out of the DB every time a question is submitted, format the data, and pass that into open AI as a data source?

Thanks in advance

James

2 Likes

Am a keen runner too with Strava API access…did you have any luck with this?

I would prob approach this by creating a prompt that creates Strava API requests from user queries and uses the API response to generate natural language answers.

To achieve this you can simply provide the API docs within your prompt and then add a few examples for how you want the user query to be converted into a valid API call format and then converted to natural language answer.

You can extract the API call from the prompt output and send to Strava API.

For the response handling you need to pass the API response back to the prompt which will then output the result as a natural language answer to the user query.

1 Like

Okay great thankyou.

Im thinking i may extract my strava data in its enitrety to a document db so that i can display each activity in some sort of list. Then a prompt input below that woukd query the data stored in db.

Just not sure if i should use the full data set to prompt pad every time i ask AI a question, or whether i can set the DB as a data source so no need to prompt pad

1 Like

Yes, you can look into functions/tools and see if the model can query your DB for the requested data.
Obviously you wouldn’t need a LLM for this task, it’s more of a data presenter, maybe?

1 Like

Would be innefficient and slow/expensive to include full db in prompt.

Much better to interact with db using API/SQL/functions/tools/actions - whatever is easiest to extract the info you need.

The LLM can format the correct requests each time if you provide it with an example and instructions. Then you will only need to extract the data you actually need each time.

1 Like

Think im missing some understanding here then. As an example for me to be able to ask in an AI dialog, ‘what is my fastest 5k run time?’ Or ‘what is my average run distance over the past year?’, i would need to provide the full data source each time?

Not sure how id use these questions to generate meaningful sql for example which in turn would return data to use as a prompt.

You can instruct the LLM to write the SQL or API call to query a data source.

The returned data can then be used by the LLM to create the output.

You can also do all of this with a Custom GPT

1 Like

Hey Jeff,

I understand that I could generate SQL using the LLM. This seems like a separate problem to solve though instead of just feeding the LLM data to create an output.

I think what you’re saying is, just so I’m clear, with the following question as an example:

“What is my longest run in the past year?”

Step1: Use LLM to generate SQL or query doc db such as "Select * runs where Date > DateTimeNow.AddYears(-1).

Step2: Use LLM to query the returned data with the question asked and create an output…

For now it seems Step 1 would be an enhancement but for a side project maybe not needed

1 Like

If its just a side project, you can simply build a custom GPT and create an action to query your data and the gpt will handle it all.

3 Likes

How you get AI to access your data to answer this question really depends on the scale of the project.

As a simple prompt, yes, you’d have to include the full data source every time. Not delightful.

@jeffinbournemouth is correct, if you just want to explore personal data and analysis over the last year, use a CustomGPT. Here’s a case study.

Download your data as spreadsheets, name them, and reference them in the cGPT’s Instructions and Knowledge Base with Code Interpreter enabled. The GPT will understand the full breadth of your information without having to prompt it.

If you have something more robust in mind than just a personal running coach, you’ll look into connecting the GPT / Assistant to your data via API. Then solutions scale up from there depending on just how much information you need the model to make decisions with.

See the create_tool_calling_executor in this project (of mine)

This is the best “tech stack” if you ask me, for doing what you want. Streamlit + Langchain + Langchain Agents. You would just try to define some “tools” that “wrap” the access to your data in ways that would allow an intelligent being to answer questions thru the API but without having “direct” access to the data. Of course one “tool” you could use could be “run SQL command” and let the AI know one of the tools takes an SQL string and returns the data.

EDIT: I need to create a generic template version of that project that has all the plumbing but nothing too specific, so people can use it as a starter project for doing Langchain stuff with Langchain Tools.