I decided to give predicted outputs a quick test run for “text proofreading”.
It seems like a good fit because the majority of the text is unchanged.
However:
- It increases costs
- Speed benefit is good but not earth-shattering
- It has an unexpectedly low hit rate
It increases costs
This is the clearest thing from the experiment. Output tokens are predictably and consistently higher.
In the case of the text above, output tokens move from 208 → 303 due to:
"usage": {
"prompt_tokens": 677,
"completion_tokens": 303,
"total_tokens": 980,
"prompt_tokens_details": {
"cached_tokens": 0,
"audio_tokens": 0
},
"completion_tokens_details": {
"reasoning_tokens": 0,
"audio_tokens": 0,
"accepted_prediction_tokens": 152,
"rejected_prediction_tokens": 94
}
},
It appears that you pay for rejected tokens, so there is a brand new cost on top of your output tokens.
Speed benefits
Anecdotally, looking at 3 runs on GPT4o:
With predicted output, 1319ms - 2262ms - 1596ms
Without predicted output, 2829ms - 6948ms - 3706ms
Additionally, when testing larger bodies of text, it appears to hit pathological states where it gets 0 predictions quite easily.
Furthermore, a typo at the beginning of a body of text can lead to rejections from that point onwards, which end up being counterproductive. It is very unclear when it will cause 60 tokens to be rejected vs the entire post.
It has an unexpectedly low hit rate
This feels like the biggest problem of the system. Ninety-four tokens can be rejected for a prediction that is only four tokens off. This makes it very hard to hone the system.
Wondering what other people have been experiencing and if proofreading is a good use for this feature?