What is the best way to upload datasets that exceed the token limit?

I have been using the OpenAI gpt3.5-turbo for building a project. I have a NoSQL data of 921 rows. Since I cannot use this in a single message or series of prompts to fit the whole data in ChatCompletion API, I am out of ideas. In the browser UI of chatgpt, this task is much easier. Are there any open source projects that are closer to solving this problem? Even GPT3 based solutions would be nice to look at.


Maybe you could ask gpt to generate a preliminary SQL statement according to your prompt, to slim it down to just relevant stuff, before you send

:wave: @abhijithneilabraham

Here are some ways to break up or analyze your data:

  1. Data Sampling: One of the simplest methods is to take a sample of the dataset that is representative of the entire dataset. This smaller, more manageable dataset can then be analyzed using the AI. It’s important to use appropriate sampling techniques to ensure the sample is representative of the whole.

  2. Data Chunking: Another method is to break the dataset into smaller, more manageable chunks or batches. These can then be fed into the AI one at a time. This method, often known as batching, is frequently used in machine learning applications, especially when training models.

  3. Feature Selection: Depending on the task at hand, not all data might be necessary. By selecting only the most important features or aspects of the data (those that are most relevant to the task or question at hand), you can greatly reduce the size of the dataset. Techniques for feature selection can include methods like correlation coefficients, mutual information, or more complex methods like recursive feature elimination.

  4. Dimensionality Reduction: Techniques like Principal Component Analysis (PCA), t-Distributed Stochastic Neighbor Embedding (t-SNE), or Autoencoders can be used to reduce the dimensionality of the dataset, i.e., the number of variables or features. These methods help reduce the dataset size while retaining the most important and significant information.

  5. Distributed Processing: For particularly large datasets, you may need to use distributed processing methods. This involves using multiple machines or processors, each working on a part of the dataset simultaneously. This method can greatly increase the speed at which the data can be processed. Techniques like MapReduce or platforms like Apache Hadoop or Apache Spark are often used for this purpose.

Have you tried gpt-4 model? It’s 2x the token limit of 3.5.

Hope this helps!