Seeking Advice: Optimizing Data Integration for Custom GPT

DigitalDoge · February 22, 2024, 10:45pm

I’m delving into an interesting challenge and could use your insights. Here’s a brief rundown of what we’re working on and the hurdles we’re facing:

Use Case Overview:

We’re pulling data from an API endpoint using ChatGPT custom actions, which returns the information as a JSON blob.
The data retrieval involves pagination, and when we try to engineer prompts for our custom GPT to make multiple API calls, the sheer volume of data becomes unmanageable for the GPT and it ends up timing out, or takes a very long time to make all the calls and process the data pulled.

Initial Solution:

As a workaround, we’ve exported the data (from the API endpoint) into a CSV format and directly add it to the custom GPT. This approach has proven effective, allowing the GPT to process and utilize the data efficiently without the aforementioned issues.

Ownership Limitation:

The limitation we’ve hit is that custom GPTs are single-owner systems, preventing multi-user management of the CSV data.

Proposed Alternative:

We’re considering setting up a SharePoint document library to facilitate collaborative addition and updating of CSV files. The idea is to use a service like Zapier to bridge SharePoint and the custom GPT, enabling the GPT to access the latest CSV data. Something like this:

image933×656 43.9 KB

Seeking Your Feedback:

I’m at a crossroads and would love to get your thoughts on whether this SharePoint and Zapier setup is a viable solution. Additionally, I’m curious about RAG (Retrieve and Generate) models.

Is there a simpler RAG setup that I might be overlooking, which could streamline this process even further? Any insights on managing data integration for custom GPTs more efficiently or experiences with similar challenges would be greatly appreciated.

Thank you in advance for your help and suggestions!

Macha · February 23, 2024, 4:01am

I’d say you’re off to a good start!

Instead of thinking about Zapier as the bridge, why not think of your project itself as the bridge? So long as you have data that you can easily send to the custom GPT you can use that as your API endpoint. Your diagram doesn’t show where your actual project is in the stack. It could be either in between the custom GPT and zapier, or replace zapier entirely. Which database you choose is up to you and your needs.
graph databases like neo4j might be a good choice to leverage relationships between the doc library and who those belong to.

SomebodySysop · February 23, 2024, 5:27am

I second that emotion. I’ doing something similar to that now. All my data is in a variety of formats (pdfs, csv, txt, etc…) as file attachments in a Drupal CMS. I use Drupal’s REST API as my endpoint to make API calls from my GPT.

Another solution: Google Sheets also has an API. You could upload CSV to GS as Sheets, and use that API.

If the data is sitting in a MySQL database, you could directly connect to it.

Our main issue here would be security. If there is a way to securely access the data where it is now, that is, of course, the best option.

DigitalDoge · February 23, 2024, 6:07pm

Thank you for pointing that out! You’ve highlighted a crucial gap in my explanation – at present, our “project” primarily consists of the custom GPT, and we’re in the early stages of figuring out how best to structure and host the supporting architecture.

The idea of incorporating a graph database is intriguing and might be a direction worth exploring. I’m open to any suggestions or insights on setting up a simple, effective stack for this purpose.

Some team members have proposed using a data warehouse, which, while comprehensive, seems like it might be more complex and resource-intensive than what we’re ready to commit to. This approach would require establishing a full database and developing API endpoints for data interaction, which has been considered a significant undertaking by my team.

If you, or anyone else here, could share resources or recommend services that simplify the setup of such systems – making them more accessible for teams with limited experience in this area – I’d be extremely grateful. Any advice on starting points, particularly those that are beginner-friendly, would be immensely valuable to us.

For the record, I do have different project that uses Python on an an Azure function app, using the OpenAI API “Assistants.” But again, I haven’t figured out a good way to host and reference data to supplement my Assistant in that project either.

Very intersting.

I’m new to Drupal, but it sounds like it offers robust API endpoints that could potentially simplify the way we pass data to our custom GPT. Could you share a bit more about how you’re using Drupal in this context? Specifically:

Hosting: Do you host your Drupal setup locally, or do you use a cloud service? I’m exploring options that would allow for either direct data storage or a way to efficiently manage data queries and pagination with an external service, minimizing the load on our custom GPT.
Handling Data and Pagination: How does Drupal’s REST API manage the pagination and data querying issues? I’m particularly interested in solutions that could offload these tasks from the GPT, allowing it to focus on processing the data rather than managing it.

Thank you for the suggestion! Unfortunately, our situation is a bit challenging due to the limitations of the third-party service we’re using. Here’s a bit more detail on what we’re dealing with:

Our third-party service provides an API for invoice retrieval but lacks the capability for more granular searches. Ideally, I’d like to query invoices based on specific tags like this:

GET https://someservice.com/invoices?=tag=search_tag

However, the API only allows for a broad retrieval of all invoices, without the option to filter by tags or other criteria directly in the query:

GET https://someservice.com/invoices

Given this, we’re required to download the entire dataset of over 5,000 invoices and then manually filter them to find the ones we’re interested in. This process is not only inefficient but also puts a significant strain on our custom GPT, which is not ideal for handling such a large volume of data directly.

I’m looking for a workaround that might help us manage this data more effectively, reducing the load on our GPT. Whether it’s through direct database connection, intermediate processing, or any other strategy, I’d really appreciate any insights or recommendations you might have.

TL,DR:

Thanks, @Macha and @SomebodySysop, for your insights. Exploring simpler solutions for our GPT’s architecture, like graph databases or Drupal’s REST API, sounds promising. We’re challenged by a third-party service’s limited API, which complicates data management for over 5,000 invoices. Looking for efficient ways to host, manage, and filter data to ease the load on our GPT. Open to any suggestions or tools that could help streamline this process.

SomebodySysop · February 23, 2024, 7:04pm

I use AWS EC2 cloud services. Let me be clear, Drupal (or a CMS like it – I hear WordPress has similar features) could be used for managing your data. Your embeddings is a different story. My setup works for me because I use a Drupal module called Search API Solr to convert my uploaded content to text, then a module I developed called SolrAI to mirror that content to my vector store. And, through the same module, I use Open AI’s (and Google, and AWS and Mistral) LLMs to handle incoming API requests.

So, in short, you still need a vector store and you still have to use either the Assistants or the Completion API to develop your own RAG mechanism, but you’ve got a an easy, open source method to do it that runs on your servers under your control.

I would look further into this if I were you. I am executing several API calls from the GPT to my server. They typically look like this:

/api/pricing-calculator/{creditAmt}/{docLimit}

For example:

{
“domain”: “bible.booksai.org”,
“method”: “get”,
“path”: “/api/pricing-calculator/{creditAmt}/{docLimit}”,
“operation”: “getPricingCalculator”,
“operation_hash”: “33a9284688917f33dfc399d86e555f6a586b5516”,
“is_consequential”: false,
“params”: {
“creditAmt”: “1”,
“docLimit”: “5”
}
}

My point being that you can definitely send multiple parameters via the GPT API.

OK, I’m just sort of spit balling here with a vague notion of what you’re trying to accomplish. But another advantage of housing my own infrastructure is that I can use it to not just receive incoming calls, but make outgoing calls.

Easy solution. Your incoming from GPT is this:

GET https://someservice.com/invoices

You can, of course, modify for the elements you need:

GET https://someservice.com/invoice_no/invoice_date/invoice_custid/invoice_lineitems/etc

Your API receives this request. Instead of downloading 5000 invoices, why not simply recode it and send the request to someotherservice like this?

GET https://someotherservice.com/invoices?=invoice_no=invoice_no&invoice_date=invoice_date&invoice_custid=invoice_custid…

You retrieve that info from someotherservice, do your magic on it, then send results back to GPT.

Problem solved.

There is an open source Drupal OpenAI project available now which might help get you jumpstarted in that CMS. You’ll have to look into it. I’m just suggesting that your issues are pretty solvable.

Macha · February 24, 2024, 12:29am

Yep, everything SomebodySysop said is pretty much what I would’ve said.

The other benefit is that you guys have an actual team to do this it sounds like.

It’s like you are a hair’s length from discovering how much you already have lol.

Honestly, the hardest part for people to wrap their heads around is the first pass (OAI API to other stuff).

This. This statement right here lol. Development and some coding is inescapable here, yes, but you already have a lot of the “hard” parts done.

I’ll reiterate this:

It sounds scary and complicated, but in actuality, you have the “scary” parts mostly figured out already.

If I were your boss, (or yourself if you had hiring powers), a DevOps engineer could the glue y’all seem to be missing. Not at all saying you couldn’t handle anything yourselves, but this sounds an awful lot like “We have cool stuff and ways we need to manage data exchange and development, but no one is used to doing this kind of work exclusively” which, in decent-to-large sized companies, is where the DevOps guy usually comes in to manage the pipelines so everyone can focus on what they do best.

DigitalDoge · February 26, 2024, 5:15pm

Agreed, and I have that part figured out, I’m able to send multiple parameters using a custom GPT/Assistant. What I was getting at here, is that the service itself, doesn’t have the ability to pass any sort of filtering parameters, which is what has sent me down this path of trying to setup some sort of RAG mechanism as a workaround.

I’m actually doing something like this using an Azure function app. It doesn’t connect to any data stores, but it does something similar. Maybe I can leverage what I’ve done there with the addition of some type of data store, to be able to accomplish this.

Yeah, agreed. I feel like I’m really REALLY close to creating something that can actually do some cool stuff, I’m just missing the RAG mechanism piece.

1000% agree with you. I’ve been the primary engineer spear-heading our development here, and communicating to the team how everything OAI works. But just keeping things realistic, the odds of us getting a headcount for a DEVOPS engineer is… not great. That said, I don’t really have another option but to try and figure something out myself.

Putting it all together;

Sounds like either an EC2 machine or Azure Function, in hand with some type of database, might just be the answer here. Then I’ll need to try and learn how to setup an API call into a service hosted in one of those environments. I’ll also need some way to store the data so my service can have a place to work with it.

So, I need to develop my own python service, plan, manage, and deploy, the resource it’s running on, and manage the database it’s using to house the information. ezpz

Sarcasm aside, thank you both @Macha and @SomebodySysop. I feel like a have a bit more direction now.

SomebodySysop · February 26, 2024, 8:37pm

I know this is the easy way to get going quickly, if if you’ve got your own development support, you may want to think about if this is the best strategy long term. REST API is the future of interconnectedness, not Python. IMHO.

DigitalDoge · February 26, 2024, 9:24pm

Watcha’ mean? My game plan would be to use Python to interact with the REST API endpoints of my services.

Topic		Replies	Views
A Pragmatic Approach to Custom GPTs: Why I Chose Airtable Over Building a Store Community chatgpt , custom-gpt , tp-1	31	4110	November 14, 2023
Opinion on GPT's - Is this good for devs? Community gpt-4 , tp-1	30	2982	November 10, 2023
Assistants API is Killing Me API api , api-billing , assistants , assistants-api , cost	38	3454	February 13, 2024
Learnings while integration third party API with GPTs Community gpt-4	8	4095	December 4, 2023
Alternatives to Assistant API API assistants-api	19	7109	January 26, 2024

Seeking Advice: Optimizing Data Integration for Custom GPT

TL,DR:

Related topics