App architecture --> how to send large dataser for analysis (exceeding token limit)

hello,

I’m learning to use OpenAI APIs to build an app and would like advice on what architecture I should use for my use case:

  • User can provide a tabular / structured data (rows & columns format)
  • Using OpenAI API —> interpret insights from the data

I can already do it in https://chat.openai.com/, copy/pasting data from spreadsheet into chat window and giving necessary prompt.
But for large size files - tens of thousands of rows, each file several GBs / TBs in size - it says input text too long.

  1. How can I address this scenario using OpenAI API

  2. Another things - how can I ensure the system remembers the uploaded data so that user can really “converse” with data through successive prompts… for ex. Each successive response, is based on previous prompts (if I see get “sales for 2023” and next prompt “which product had highest sales” —> I really it should know I’m asking for “product with highest sale in 2023”

1 Like

Hi @test.user.00

In your case you can use gpt-3.5-turbo for completion.

1 Like

@sps thank you for quick response.

could you pls. elaborate (or share links that can help guide me?!)
basically how do I handle large input that exceeds max. token size to the input model.

thanks!

ok, I have done more research on it and now understand better.
I have follow-up Qs:

  1. from my understanding, Embedding is mainly for text data… can anyone tell me intuition behind how it can be useful to draw inference / insight from tabular (rows & columns) data (its’ basically time series data of user purchase history on Ecommerce. store… no text/reviews etc.)

  2. when making embeddings
    a) should i create embedding for each rows separately (and store)
    b) or should i concatenate each row (till max token) —> create embedding —> finally concatenate all embedding to get embedding for entire dataset?

thank you for any lead in that regard

@sps

I found anther post - Tabular data converted to embeddings not returning accurate results

→ it seems this user same approach as you suggest.
but not good results…

Indeed, but it can be used to identify specific records of data that are near the topic of queried interest.

Imagine you use embeddings to identify the type of data you need to answer the question, which then triggers an aggregation using everyday software engineering. Then, armed with aggregated data, you use a chat completion to wrap it into a narrative that examines the aggregated data and interprets it.

This post kind’a leans into what I’m suggesting.

1 Like

@bill.french thank you for reply.
i read the your post, and that starts to make more sense to me…

on other hand, i feel it it runs counter to previous user’s reply (Embedding. + Completions)

perhaps Open AI is not suited for analyzing time series sort of data (which is always huge), but only aggregated data…

Today, there are ceilings on the amount of data you can throw at the LLMs. Most important, my skills in large data sets are limited. :wink:

And there are practical limitations at a financial level. You could recursively pummel the API with paginated data, but the cost would be prohibitive. I think the only rational approach is to aggregate first.

Maybe there are some experts that know the secret sauce for your time-series use case.