What’s the limitation in your use case preventing you from splitting it up into two parallel requests for it to fill out 10 pages at a time? Or even 4 requests for filling out 5 pages at a time?
Depending on what your trying to achieve, it’s possible that your simply not providing your input in a structured enough way, or that even though it “fits within the prompt token limit”, it’s really just “too much” (hard to say without knowing more about what your doing… 20 “pages” at 8pt font is very different than “20 pages” at 12pt font! )
If you really want to talk about it, you need to tell us:
how many tokens are you using in the test prompt?
are you providing separate pre-prompting (or combined with the test prompt) instructions that give clear expectations of how you want the response? (talking natural language instructions here, not internal restrictions through use of settings in “Responses API”.
how many tokens are you getting/expecting as output? (what would be a “full and complete 20 page result” in terms of output tokens, approximately?
For such a long prompt as that, if the data itself is a sort of “single context” (i.e. the “single test”) even if you are providing instructional parameters and highly structured data, that’s going to be sort of difficult for the LLM to keep track of.
You’d probably have much better results setting up an auto-feed auto-response system to feed it a few pages or a few questions at a time, log the responses, and continue. Depending on the model your using and such, your going to often get “much better” (in this case, much different? i.e. the questions might be answered differently…) results if you do things in smaller chunks… plus depending on what your trying to measure in this “20 page multiple choice test”, your going to get some variation in results depending on these aspects of prompt length, single-shot vs. multi-shot, etc.
Consider if you do a multi-shot approach to drop the previous context window in every subsequent call unless it’s critically “building on previous answers” to answer subsequent questions
- i.e. each call is
- (“instructions + current pages of input”)
- instead of
- (“instructions + previous pages of input + previous response + current pages of input”)