How Can I Ensure an LLM Answers a Static 20-Page Multiple-Choice Assessment in Order?

Hi everyone,

I fine-tuned an LLM to process a static, simplified 20-page multiple-choice assessment (extracted from a 42-page form). My goal is for the model to answer every question in order and return the full assessment in the same structured format.

I used multiple agents (with knowledge base injected) to handle sections and implemented search file tool as well. I tried enforcing strict output using pedantic JSON, but the approach became too token-heavy due to the length of the form and multiple-choice options. Despite this, the model still occasionally omits questions or answers out of order.

How can I ensure the LLM reliably completes the entire assessment, maintaining both structure and order?

Any insights on enforcing strict output while keeping it efficient would be greatly appreciated.

Thanks in advance!

Robert

2 Likes

What’s the limitation in your use case preventing you from splitting it up into two parallel requests for it to fill out 10 pages at a time? Or even 4 requests for filling out 5 pages at a time?

Depending on what your trying to achieve, it’s possible that your simply not providing your input in a structured enough way, or that even though it “fits within the prompt token limit”, it’s really just “too much” (hard to say without knowing more about what your doing… 20 “pages” at 8pt font is very different than “20 pages” at 12pt font! )

If you really want to talk about it, you need to tell us:

how many tokens are you using in the test prompt?

are you providing separate pre-prompting (or combined with the test prompt) instructions that give clear expectations of how you want the response? (talking natural language instructions here, not internal restrictions through use of settings in “Responses API”.

how many tokens are you getting/expecting as output? (what would be a “full and complete 20 page result” in terms of output tokens, approximately?



For such a long prompt as that, if the data itself is a sort of “single context” (i.e. the “single test”) even if you are providing instructional parameters and highly structured data, that’s going to be sort of difficult for the LLM to keep track of.

You’d probably have much better results setting up an auto-feed auto-response system to feed it a few pages or a few questions at a time, log the responses, and continue. Depending on the model your using and such, your going to often get “much better” (in this case, much different? i.e. the questions might be answered differently…) results if you do things in smaller chunks… plus depending on what your trying to measure in this “20 page multiple choice test”, your going to get some variation in results depending on these aspects of prompt length, single-shot vs. multi-shot, etc.

Consider if you do a multi-shot approach to drop the previous context window in every subsequent call unless it’s critically “building on previous answers” to answer subsequent questions

  • i.e. each call is
    • (“instructions + current pages of input”)
  • instead of
    • (“instructions + previous pages of input + previous response + current pages of input”)
2 Likes