How Can I Ensure an LLM Answers a Static 20-Page Multiple-Choice Assessment in Order?

robert.briana · March 28, 2025, 10:01pm

Hi everyone,

I fine-tuned an LLM to process a static, simplified 20-page multiple-choice assessment (extracted from a 42-page form). My goal is for the model to answer every question in order and return the full assessment in the same structured format.

I used multiple agents (with knowledge base injected) to handle sections and implemented search file tool as well. I tried enforcing strict output using pedantic JSON, but the approach became too token-heavy due to the length of the form and multiple-choice options. Despite this, the model still occasionally omits questions or answers out of order.

How can I ensure the LLM reliably completes the entire assessment, maintaining both structure and order?

Any insights on enforcing strict output while keeping it efficient would be greatly appreciated.

Thanks in advance!

Robert

lucid.dev · March 28, 2025, 10:12pm

What’s the limitation in your use case preventing you from splitting it up into two parallel requests for it to fill out 10 pages at a time? Or even 4 requests for filling out 5 pages at a time?

Depending on what your trying to achieve, it’s possible that your simply not providing your input in a structured enough way, or that even though it “fits within the prompt token limit”, it’s really just “too much” (hard to say without knowing more about what your doing… 20 “pages” at 8pt font is very different than “20 pages” at 12pt font! )

If you really want to talk about it, you need to tell us:

how many tokens are you using in the test prompt?

are you providing separate pre-prompting (or combined with the test prompt) instructions that give clear expectations of how you want the response? (talking natural language instructions here, not internal restrictions through use of settings in “Responses API”.

how many tokens are you getting/expecting as output? (what would be a “full and complete 20 page result” in terms of output tokens, approximately?

For such a long prompt as that, if the data itself is a sort of “single context” (i.e. the “single test”) even if you are providing instructional parameters and highly structured data, that’s going to be sort of difficult for the LLM to keep track of.

You’d probably have much better results setting up an auto-feed auto-response system to feed it a few pages or a few questions at a time, log the responses, and continue. Depending on the model your using and such, your going to often get “much better” (in this case, much different? i.e. the questions might be answered differently…) results if you do things in smaller chunks… plus depending on what your trying to measure in this “20 page multiple choice test”, your going to get some variation in results depending on these aspects of prompt length, single-shot vs. multi-shot, etc.

Consider if you do a multi-shot approach to drop the previous context window in every subsequent call unless it’s critically “building on previous answers” to answer subsequent questions

i.e. each call is
- (“instructions + current pages of input”)
instead of
- (“instructions + previous pages of input + previous response + current pages of input”)

Topic		Replies	Views
Issues with Truncated Responses API	3	2664	April 22, 2024
Struggling to extract correct items from list in document API	12	596	September 24, 2024
Large document - Inject into API or use knowledge base with semantic search? Prompting gpt-4 , api	6	429	May 16, 2024
Data Validation: Help and Suggestions needed Community gpt-4 , chatgpt	5	127	August 7, 2024
For Generate MCQ with question, option a, option b, option c, option d, correct answer, correct answer xaplanation, bloom level, diffciulty level, complexiety level etc... like 12 columns are present API gpt-4	5	710	April 24, 2024

How Can I Ensure an LLM Answers a Static 20-Page Multiple-Choice Assessment in Order?

Related topics