App architecture --> how to send large dataser for analysis (exceeding token limit)

test.user.00 · May 10, 2023, 5:47am

hello,

I’m learning to use OpenAI APIs to build an app and would like advice on what architecture I should use for my use case:

User can provide a tabular / structured data (rows & columns format)
Using OpenAI API —> interpret insights from the data

I can already do it in https://chat.openai.com/, copy/pasting data from spreadsheet into chat window and giving necessary prompt.
But for large size files - tens of thousands of rows, each file several GBs / TBs in size - it says input text too long.

How can I address this scenario using OpenAI API
Another things - how can I ensure the system remembers the uploaded data so that user can really “converse” with data through successive prompts… for ex. Each successive response, is based on previous prompts (if I see get “sales for 2023” and next prompt “which product had highest sales” —> I really it should know I’m asking for “product with highest sale in 2023”

sps · May 10, 2023, 7:28am

Hi @test.user.00

In your case you can use gpt-3.5-turbo for completion.

test.user.00 · May 10, 2023, 7:38am

@sps thank you for quick response.

could you pls. elaborate (or share links that can help guide me?!)
basically how do I handle large input that exceeds max. token size to the input model.

thanks!

test.user.00 · May 11, 2023, 11:23am

ok, I have done more research on it and now understand better.
I have follow-up Qs:

from my understanding, Embedding is mainly for text data… can anyone tell me intuition behind how it can be useful to draw inference / insight from tabular (rows & columns) data (its’ basically time series data of user purchase history on Ecommerce. store… no text/reviews etc.)
when making embeddings
a) should i create embedding for each rows separately (and store)
b) or should i concatenate each row (till max token) —> create embedding —> finally concatenate all embedding to get embedding for entire dataset?

thank you for any lead in that regard

test.user.00 · May 11, 2023, 12:46pm

@sps

I found anther post - Tabular data converted to embeddings not returning accurate results

→ it seems this user same approach as you suggest.
but not good results…

bill.french · May 11, 2023, 2:00pm

Indeed, but it can be used to identify specific records of data that are near the topic of queried interest.

Imagine you use embeddings to identify the type of data you need to answer the question, which then triggers an aggregation using everyday software engineering. Then, armed with aggregated data, you use a chat completion to wrap it into a narrative that examines the aggregated data and interprets it.

This post kind’a leans into what I’m suggesting.

test.user.00 · May 15, 2023, 1:37am

@bill.french thank you for reply.
i read the your post, and that starts to make more sense to me…

on other hand, i feel it it runs counter to previous user’s reply (Embedding. + Completions)

perhaps Open AI is not suited for analyzing time series sort of data (which is always huge), but only aggregated data…

bill.french · May 15, 2023, 10:06am

Today, there are ceilings on the amount of data you can throw at the LLMs. Most important, my skills in large data sets are limited.

And there are practical limitations at a financial level. You could recursively pummel the API with paginated data, but the cost would be prohibitive. I think the only rational approach is to aggregate first.

Maybe there are some experts that know the secret sauce for your time-series use case.

Topic		Replies	Views
CSV File How to best parse larger data? API	12	3376	December 19, 2023
What is the best way to upload datasets that exceed the token limit? API	3	1594	December 18, 2023
Has anyone successfully used OpenAI to interpret data sets? Prompting	7	2088	December 18, 2023
Creating a conversational chat bot with a large data set API	4	3297	March 2, 2023
How can I use chat/completion API on large datasets of "arbitrary" JSON API gpt-4 , fine-tuning , token , json	7	2802	March 12, 2024

App architecture --> how to send large dataser for analysis (exceeding token limit)

Related topics