Current APIs don’t let developers create or reuse a prompt cache/prefill state without generating output tokens. This prevents “speculative UI branching” patterns where you want to precompute multiple likely continuations and then respond instantly when the user selects a path.
In my project, ~0.5-1 second time-to-first-token makes a LLM-powered interface feel sluggish. Generating full responses for each possible UI branch is expensive and wasteful, especially when only one path is ultimately used.
A solution would be to add an explicit prefill or cache-warm operation that computes and stores the model’s prefix state for a given prompt, with the option to cheaply fork that state. A follow-up call could then continue decoding from the cached state after the user makes a choice.
I believe the infrastructure already exists. Exposing the prefill step to the API allows developers to build more precise and dynamic UX. I’d like to hear thoughts or use cases around latency reduction or branching.