What is the best way to upload datasets that exceed the token limit?

abhijithneilabraham · April 26, 2023, 6:46pm

I have been using the OpenAI gpt3.5-turbo for building a project. I have a NoSQL data of 921 rows. Since I cannot use this in a single message or series of prompts to fit the whole data in ChatCompletion API, I am out of ideas. In the browser UI of chatgpt, this task is much easier. Are there any open source projects that are closer to solving this problem? Even GPT3 based solutions would be nice to look at.

Thanks!

SomeUser2022 · April 26, 2023, 7:44pm

Maybe you could ask gpt to generate a preliminary SQL statement according to your prompt, to slim it down to just relevant stuff, before you send

michael23 · July 25, 2023, 5:52pm

@abhijithneilabraham

Here are some ways to break up or analyze your data:

Data Sampling: One of the simplest methods is to take a sample of the dataset that is representative of the entire dataset. This smaller, more manageable dataset can then be analyzed using the AI. It’s important to use appropriate sampling techniques to ensure the sample is representative of the whole.
Data Chunking: Another method is to break the dataset into smaller, more manageable chunks or batches. These can then be fed into the AI one at a time. This method, often known as batching, is frequently used in machine learning applications, especially when training models.
Feature Selection: Depending on the task at hand, not all data might be necessary. By selecting only the most important features or aspects of the data (those that are most relevant to the task or question at hand), you can greatly reduce the size of the dataset. Techniques for feature selection can include methods like correlation coefficients, mutual information, or more complex methods like recursive feature elimination.
Dimensionality Reduction: Techniques like Principal Component Analysis (PCA), t-Distributed Stochastic Neighbor Embedding (t-SNE), or Autoencoders can be used to reduce the dimensionality of the dataset, i.e., the number of variables or features. These methods help reduce the dataset size while retaining the most important and significant information.
Distributed Processing: For particularly large datasets, you may need to use distributed processing methods. This involves using multiple machines or processors, each working on a part of the dataset simultaneously. This method can greatly increase the speed at which the data can be processed. Techniques like MapReduce or platforms like Apache Hadoop or Apache Spark are often used for this purpose.

Have you tried gpt-4 model? It’s 2x the token limit of 3.5.

Hope this helps!

Topic		Replies	Views
App architecture --> how to send large dataser for analysis (exceeding token limit) API	8	8950	December 17, 2023
How can I use chat/completion API on large datasets of "arbitrary" JSON API gpt-4 , fine-tuning , token , json	7	2804	March 12, 2024
Working with GPT 3.5 Turbo to query JSON data - ChatGPT and Token Limits API	4	3266	May 17, 2023
How to overcome OpenAI fine-tuning training data token limit? API api	5	2438	December 18, 2023
Optimization of large requests to GPT API chatgpt , chat-completion , assistants-api	1	1632	November 24, 2023

What is the best way to upload datasets that exceed the token limit?

Related topics