Best way to get the API to read the whole document?

Hi there, I am trying to get the API to read a rather large research paper in PDF format. The word count fits within the API token limit and I am manually uploading the PDF and then using retrieval to access the file with the API.

I am trying to get the API to summarize the document, and to give comments on it. However, it can’t seem to fit the whole scope of the file into its response, either it responds with about the first ~20% or the last ~20%.

From my understanding, the functionality I am trying to get working should be possible since the file fits within the 2 million tokens per file. Has anyone managed to get the API to work with the full file regarding summarization?

What does your prompt look like?

Summarizing a large document is unfortunately still not a straightforward task and may require different techniques.

This is so far the closest I have been able to get it to doing what I need it to do

Provide an in-depth summary of the document in memory. Each paragraph should cover major points from the entire document. Avoid citation marks. Deliver the summary in plain text format. Do not inclue your own input, apart from what is asked of you.

One way to build it out further is to ask it to summarize the key points section by section. I occasionally also combine this with chain of thought prompting, whereby I add the specific phrase “Let’s think it through step-by-step” at the end of the prompt. In combination, I tend to get more insightful summaries, although they remain limited in depth.

However, if I am looking for more specific results, I tend to still break down the document into smaller parts and then run specific summarization requests over these individually

2 Likes

Also, sometimes the word summary itself can impact the length and quality of your output. Try exchanging it for different terms/instructions, such as asking the model to prepare a detailed memo or some such. It can help.

1 Like

I’ll give that a go. However have you ever managed to get it to fully take an entire document and basically give you a summary going from start to end?

Well, with the above approach it at least does not focus purely on the beginning and the end and instead takes all sections into account.

By default, given the output token limits, there are limitations as to what you can achieve in a single request.

I believe if all the major points were considered then it would all fit within the token limit however it is very clear that it can’t fit the whole file into the scope. I am messing around with prompts now thanks to your suggestion.

1 Like

The longest output I have ever achieved in a single request - independent of the input length - was around 650 words, which comes close to 1000 tokens. No matter what I tried - which seems to be consistent with the experience of others.

Anyway, if it’s just about getting a coherent summary that considers all parts of the document, the above approach should help. Let me know if you run into issues and I can provide a sample prompt.

Issue is the 650 words does not bother me but how much content its able to absorb and then fit into the 650. I need it to give me a summary from a to z but it spends all the 650 words on like a to c

yeah, but you can build restrictions into your prompt and include the instruction that it should focus on the body of the text and not the executive summary or introduction. that helps too.

You have to really play around with your prompt a bit. Often, a single change in an instruction can make a material difference. So just experiment a bit with it.

1 Like

So I ended up rewriting all of the logic where I split up the file into smaller parts, basically taking 20 minutes to do per file, and it still prioritized the first few pages of the file even though I was telling it to only focus on page 50 to page 60 (example).

So the whole summary is just the first 10 pages.

well, the model does not understand the concept of pages. You can give it cues by referring to specific sections you want it to emphasize, using section headers as reference.

The splitting strategy in general should work because you are treating each part of the split file separately from a summarization point of view and then add the individual summaries into an aggregate one (I am just making the assumption that this is the approach you followed).

So not sure why it is not working.

Well I have one file and it has page numbers on it, its basically the only numbers present in the file. So I ask it for a summary of pages 1 - 10 then 11- 20, etc. Since those are the only numbers on the page I would’ve assumed that would’ve worked.

Can I ask: do you actually split the file, i.e. take the content of the file and then split it into smaller pieces and feed each piece separately to the model via API call for summary?

At the moment no, however, I may have to give that a go, after today’s rewrite that will probably have to happen tomorrow.

So it worked… kind of. Now my processing time went from under 10 minutes per file to over 30 minutes. I think it’s worth it.

1 Like