I’m learning to use OpenAI APIs to build an app and would like advice on what architecture I should use for my use case:
User can provide a tabular / structured data (rows & columns format)
Using OpenAI API —> interpret insights from the data
I can already do it in https://chat.openai.com/, copy/pasting data from spreadsheet into chat window and giving necessary prompt.
But for large size files - tens of thousands of rows, each file several GBs / TBs in size - it says input text too long.
How can I address this scenario using OpenAI API
Another things - how can I ensure the system remembers the uploaded data so that user can really “converse” with data through successive prompts… for ex. Each successive response, is based on previous prompts (if I see get “sales for 2023” and next prompt “which product had highest sales” —> I really it should know I’m asking for “product with highest sale in 2023”
could you pls. elaborate (or share links that can help guide me?!)
basically how do I handle large input that exceeds max. token size to the input model.
ok, I have done more research on it and now understand better.
I have follow-up Qs:
from my understanding, Embedding is mainly for text data… can anyone tell me intuition behind how it can be useful to draw inference / insight from tabular (rows & columns) data (its’ basically time series data of user purchase history on Ecommerce. store… no text/reviews etc.)
when making embeddings
a) should i create embedding for each rows separately (and store)
b) or should i concatenate each row (till max token) —> create embedding —> finally concatenate all embedding to get embedding for entire dataset?
Indeed, but it can be used to identify specific records of data that are near the topic of queried interest.
Imagine you use embeddings to identify the type of data you need to answer the question, which then triggers an aggregation using everyday software engineering. Then, armed with aggregated data, you use a chat completion to wrap it into a narrative that examines the aggregated data and interprets it.
Today, there are ceilings on the amount of data you can throw at the LLMs. Most important, my skills in large data sets are limited.
And there are practical limitations at a financial level. You could recursively pummel the API with paginated data, but the cost would be prohibitive. I think the only rational approach is to aggregate first.
Maybe there are some experts that know the secret sauce for your time-series use case.