Answering lots of questions from one large chunk of text without paying tokens to input the big text chunk for each question?

I want the chatGPT API to answer lots of questions based off a fairly large chunk of text. It might be several hundred questions and the chunk of text could be 50,000+ words.

My plan is to get chatGPT to turn the text into point form to hopefully get it to 5,000 or 10,000 words long and then use the GPT4 api or the 16k 3.5 api to answer the questions.

The problem I’m anticipating is that it’s going to be very expensive if I use gpt4 to input several thousand words just to have it answer one question… because if I do that a couple hundred times, I’m inputting a ton of tokens.

Is there a way to just input the data once, and only pay for those tokens once… and then the only additional tokens it will charge me is just for the question and the response, and not the thousands of words of text that the answer will be based off of?

I’m going to be doing this a lot. The questions will always stay the same, but the chunk of data will change each time. I might do this once a day. I basically want to make some python code that says “here’s this big chunk of text in a word document… now answer these 200 questions from an excel spreadsheet based off this big chunk of text and put the responses back into excel”. And then the next day, it will be a different chunk of text, and I’ll want the answers to those same 200 questions.

I’m open to any kind solution that will work. I would ideally like to use the gpt4 api (I have access to it) since I trust it the most not to make mistakes and give me good responses, but I’ll do whatever works best without breaking the bank.

1 Like

Welcome to the forum.

There is no state for the API. You must send what you want it to know each go-around.

1 Like

Thanks for the reply. That’s too bad that this isn’t possible at the moment.

Maybe my best approach would be to try to divide the questions into sections and maybe see if it can do 10 questions at a time and then I might only have to input that chunk 20 times… and it might not be all that expensive, especially if I use 3.5… which might still do a good job since I’m not asking it to do anything too complex that requires much “thinking”.

1 Like

Have you heard of embedding? Might be useful for what you’re trying to do.

1 Like

Ya, I had heard about that vector stuff, but I don’t think I ever really looked into how to use. I think I understand the concept, and you’re right… that could be useful. I just briefly read this, so I can probably get chatGPT to teach me how to use it in the program.

So I guess the idea is to use this embedding vector thing on this big chunk of text, so then when I ask it questions about that chunk of text, it will only look at the relevant sections rather than the whole thing… and it will only charge me for the tokens for looking at the relevant data rather than the entire text, most of which isn’t relevant to that question (but will be relevant to other questions).

Does this also help to get around the token limit? So if I do this embedding vector thing on a 100,000 word MS word document or PDF or something like that, can I still ask questions based on the text from the entire document and it doesn’t go over the max token limit because each vector section that it looks up is only a small fraction of the total amount of text?

Am I understanding this correctly about what it can do?

Well, the token limits still apply, but the idea would be that you construct a prompt below the limits using the best matches from the vector search of the entire corpus…before sending to the API.

That makes sense.

This isn’t what I’m using it for, but for example, let’s say I have a huge text document that details how someone spent their week. So it says everywhere they went. Everything they ate. Everything they bought. When they brushed their teeth. How often they drove. Who they interacted with. Stuff like that. And all that is in a big pdf file that amounts to 50,000 words.

Then I would want to have questions answered about different topics. So maybe I have 10 questions about what food they ate. e.g. How many times in the week did they eat vegetables? What fruits did they eat over the week? How often did they get take-out? etc.

Then another set of questions would be about transportation. How long did they spend driving? How much did they walk? Did they ever go on jogs? Did they use other forms of transpiration like a bicycle?

So if I use the embedding vector thing maybe with the Clustering (where text strings are grouped by similarity) feature… when I ask the question about what fruits they ate… it sends that question to the API and only the information from the text document relating to food.

That would solve two problems. First is that the 50,000 word document doesn’t need to be summarized or turned into point form to get it under the 8k or 16k limit. And the second benefit would be that it’s not going to cost very much to ask each question since I’m not sending thousands of words to the API for each question… only the relevant cluster of text.

Is that right?

This is what ChatGPT-4 said:

Yes, hugzbauer’s understanding is largely correct. Here’s a breakdown:

  1. Embedding & Vector Representation: This approach essentially involves turning each section or paragraph of the large text into a vector, using some sort of embedding mechanism. This could be using methods like Word2Vec, FastText, or more modern embeddings like BERT or Universal Sentence Encoder. These vectors essentially capture the semantic meaning of the text, and allow you to find the “most relevant” portions of the document based on a given query.
  2. Relevance & Querying: Once you’ve converted your document into a series of vectors, you can then take your question, also convert it into a vector using the same embedding mechanism, and then find the most relevant parts of the document by comparing the question vector with the document vectors. Methods like cosine similarity or dot product can help you rank the sections or paragraphs by relevance.
  3. Sending to the API: After determining the most relevant parts, you would then send those parts along with your question to the GPT API. This way, you’re only sending a smaller, highly relevant chunk of the document, and not the entire thing. This reduces the number of tokens you’re sending and increases the accuracy of the answer.
  4. Token Limits: While this method can help reduce the number of tokens sent to the API for each question, you still have to abide by the token limit for any single request. If the most relevant sections of the document still exceed the token limit, you’d have to truncate or select only the top few sections.
  5. Costs: Yes, by sending fewer tokens with each request, the cost per query would be reduced.
  6. Clustering: While clustering could be beneficial in grouping similar sections together, it’s a slightly different approach. With clustering, you’re trying to group sections of the document that are similar to each other, without necessarily having a specific question in mind. It might be useful for an initial categorization of the data, but querying based on specific questions would still rely on the embedding and relevance method.

One thing to note is that the process of turning the document and questions into vectors, and finding the most relevant sections based on these vectors, would likely require some initial experimentation and fine-tuning to get accurate and relevant results. It’s also worth mentioning that while this approach can improve the efficiency and reduce the cost of querying the API, the quality of the answers would still depend on how relevant and detailed the sections sent to the API are.

Hi! I’d say you have two challenges: the large amount of unstructured, qualitative input data and the amount of queries to be performed on that data. As a result the base case is a scenario where the input data has to be divided into separate parts and then querying all parts with a specific question until all questions are answered.
Looking at the problem like this you can decide if there are ways to structure your data in advance. Create excerpts that only contain information about dates, times and the food eaten. Then this excerpt could already be small enough to pass in at once and answer several questions about nutrition intake. If the excerpt is of high quality, maybe a cheaper model can do this task (be careful with GPT and counting).
If the answers can be short, then maybe answering several in one request becomes an option.

Building upon this idea: create a template to summarize all food related items in a common data format like JSON or XML. Monday, 9:32, 2 Bananas, fruit, cold, happy mood… then this could be handled without the use of a LLM.


This is the kind of application that vector embedding might work. If you’re trying to count discrete events from your large PDF file, however, vector embeddings is less good for that application because what you’ll get is the top N results ranked by similarity of your question and the embeddings. You can them process the embeddings in the LLM, but it’s a bit difficult to know if you’ve gotten all of the relevant embeddings from your vector store.

Embeddings are better for answering if your answer is anywhere in the large document store, not how many unique events are in the vector store. Which is all to say, someone else recommended structuring or alternative forms of indexing. At which point the correct retrieval technology is something like a SQL database, not LLM powered whatever.

[<-go->] I would also suggest the functionality of the forum search:

I have experimented with this concept A LOT.
and when i say a lot, i dont mean “quite a bit sometimes”
I mean “quite a lot every time”

But take it as you will; what i have discovered, which i will quite in the following…
"… When it comes to getting AI to understand large bodies of text, it is most efficient to do a multipass query. What do i mean by that? Im glad that i asked… Its about not only breaking down text in to summaries, but also about summarizing in the context of the question.

Ask for a summary about a certain thing, then the other information will be lost.
Ask for a summary about everything and just about everything will be forgotten.

So the trick is to get the AI to break things down.

When creating a document processor, you should prompt the AI in the context of chunk by chunk and no so much in the grand context.

However each chunk of processing must also have an appropreate prompt with which to summarize.

Does this make sense? or does it sound like rambling?
I can never tell…

also i would like to reiterate… I have no idea what im talking about, and anything i say should be taken with the proverbial grain of salt.

(legal speak just in case)

I have recently been working on this same problem, here is how I solved it.

Use Flowise which is a beautiful drag and drop UI for connecting OpenAI, vector databases, documents and more to ingest your document/information into a vector database - personally I have used Pinecone & Vectara (vectara is easier to use if you are non technical but Pinecone is easier to use with flowise as there are more guides online).

there is a YouTube series by Leon van Zyl called Flowise AI tutorial which is super helpful for setting this stuff up. I highly recommend you check it out and watch the whole thing!

Once you have embedded the information into a vectorDB, you can ask questions in the flowise chatbot. The questions will then go through the same embedding model you used to ingest the documents into the vectorDB and will find the most relevant answer to your question.

The process is basically:

  1. set up OpenAI API, vector database and flowise.
  2. ingest your documents into vectorDB (Pinecone, vectara or other) using text splitter, embedding model, OpenAI API and a chain.
  3. set up a chat flow that is connected to your vector database and OpenAI API.
  4. ask questions and get answers

You should play around with the prompt to get the best, most accurate answers and may want to play around with metadata or namespaces if your data can be categorised further than just 1 big amount of text.

Final thing, you can set up flowise on render which is a cloud hosting service so you don’t have to do it locally which then means you can allow other people to use it!

Thanks for all the replies. I think me and ChatGPT can figure out something that works with all the information all of you have provided. I don’t know how to program and I’m good with computers for a layman, but I’m not an IT guy or anything… but with the help of ChatGPT, I’ve managed to get python code and excel macros working for stuff I need to do for work, so I can probably figure this out.

I should note that most of the information I need is text based and doesn’t involve math. So the questions would be more like: What type of vehicle did you drive in (make and model) … and not really questions like, how many times did you drive.

I guess that’s important because all I need it to do is find certain types of text within the documents and I don’t really need it to add anything up.

It could be questions like: what activities did the person enjoy the most over the week? what activities did they struggle with most over the week?

And then if there’s a sentence in there about how the person struggled to put on a tie and how they enjoyed watching the sunset… it could find that information within the document and write about it.

The way these models work, you have to put in all the context for each request.
It’s a fundamental design constraint of the models, they wouldn’t know what to do if you didn’t.

And the constraint for OpenAI is “capacity to run models;” there really wouldn’t be much extra cost to them to store and re-use your previous data, but because the real cost is in the amount of tokens put into (and gotten out of) the models, it makes more sense to let you have 100% control over that.

Now, what does “breaking the bank” really mean? Note that input tokens cost half as much as output tokens. (Honestly, I think, in real terms, the difference is even more than that, because each output token requires a full inference loop, whereas input is significantly cheaper to calculate.)
So, let’s say that you load up the model, put 7.5k tokens in, get 500 tokens out, per invocation.
That costs you 25 cents.
What questions do you have to ask, where it’s worth your time to formulate the question, but it’s not worth 25 cents to get an answer to the question?

Also, you can reduce the size needed of the input data, by chunking it into snippets, and using an embedding index to retrieve only the chunks that match the question “best.” This may let you cut down the amount of context to whatever size you’re comfortable with, at the risk of missing something in the input document/context. (But then, the models aren’t great at paying attention to everything in a big context, anyway!)