Introducing Predicted Outputs

nikunj · November 5, 2024, 12:17am

Dramatically decrease latency for gpt-4o and gpt-4o-mini by providing a reference string.

Speed up:

Updating a blog post in a document
Iterating on previous model responses
Rewriting code in an existing file, like with Exponent in this video, which saw a ~3X speed-up

Get started with our docs.

_j · November 5, 2024, 12:59am

TL;DR:

`if`:

YOU can predict what the AI will write for a good portion of contiguous output, and
can disable several features, including cost limitations, and
want to pay separate output price for anything you “predict” that isn’t used.

`then`:

response may come faster

It seems it is not:

complete on a partial assistant output
output the AI can reuse until there is significant departure
suggest a response

Perhaps it needs to be exactly what the AI would write up to a point - but of course with changing inputs and random sampling, that ‘exactly’ could change at the first token.

Is the only thing that one would possibly see is a speedup - with no other evidence it is working?

I think much more documentation of the mechanism, the run length of tokens it operates on, whether chunks can be provided, or a completion up to your own expected divergence should be provided, etc.

ted-at-openai · November 5, 2024, 1:41am

Yeah, it won’t change the output of the model at all. It just speeds it up if you can guess a big amount of what the model replies with. This can happen if you’re asking for edits on a big doc or code file.

_j · November 5, 2024, 5:02am

The sending of a prediction seems to have an opposite relationship to the desired operation. Lower output speeds.

Using the prediction parameter gets you slower, more expensive AI instead of no effect when there is no match.

Predictions sent for a task of writing about cute kittens
(nothing cache-able; top_p: 0.0001 not at all making a regular length):

prediction 0: (empty string)
prediction 1: “kitten kittens cute”
prediction 2: the text part of the predicted output documentation

Calls are interleaved.

For 5 trials of prediction 0 @ 2024-11-04 08:44PM:

Stat	Average	Cold	Minimum	Maximum
stream rate	Avg: 123.340	Cold: 103.2	Min: 103.2	Max: 154.4
latency (s)	Avg: 0.537	Cold: 0.5426	Min: 0.37	Max: 0.9877
total response (s)	Avg: 1.788	Cold: 2.1319	Min: 1.5359	Max: 2.1319
total rate	Avg: 86.413	Cold: 77.396	Min: 77.396	Max: 92.525
response tokens	Avg: 153.000	Cold: 165	Min: 137	Max: 169
cost tokens	Avg: 154.800	Cold: 167	Min: 138	Max: 171
prediction tokens	Avg: 0.000	Cold: 0	Min: 0	Max: 0
accepted tokens	Avg: 0.000	Cold: 0	Min: 0	Max: 0
rejected tokens	Avg: 0.800	Cold: 1	Min: 0	Max: 1

For 5 trials of prediction 1 @ 2024-11-04 08:44PM:

Stat	Average	Cold	Minimum	Maximum
stream rate	Avg: 88.180	Cold: 85.5	Min: 54.6	Max: 115.8
latency (s)	Avg: 0.801	Cold: 0.5172	Min: 0.3544	Max: 1.806
total response (s)	Avg: 2.568	Cold: 2.3072	Min: 1.658	Max: 3.4577
total rate	Avg: 62.057	Cold: 66.748	Min: 39.622	Max: 91.677
response tokens	Avg: 147.600	Cold: 154	Min: 137	Max: 154
cost tokens	Avg: 152.400	Cold: 159	Min: 141	Max: 159
prediction tokens	Avg: 4.000	Cold: 4	Min: 4	Max: 4
accepted tokens	Avg: 0.000	Cold: 0	Min: 0	Max: 0
rejected tokens	Avg: 3.800	Cold: 4	Min: 3	Max: 4

For 5 trials of prediction 2 @ 2024-11-04 08:44PM:

Stat	Average	Cold	Minimum	Maximum
stream rate	Avg: 91.560	Cold: 86.8	Min: 68.8	Max: 118.1
latency (s)	Avg: 0.509	Cold: 0.6555	Min: 0.3166	Max: 0.6555
total response (s)	Avg: 2.156	Cold: 2.2915	Min: 1.7055	Max: 2.3228
total rate	Avg: 69.247	Cold: 62.405	Min: 59.813	Max: 85.019
response tokens	Avg: 147.400	Cold: 143	Min: 138	Max: 160
cost tokens	Avg: 211.600	Cold: 216	Min: 180	Max: 233
prediction tokens	Avg: 353.000	Cold: 353	Min: 353	Max: 353
accepted tokens	Avg: 0.000	Cold: 0	Min: 0	Max: 0
rejected tokens	Avg: 63.200	Cold: 72	Min: 28	Max: 72

Token Usage Log: 5 trials of 0:

Measured	Completion	Rejected
165	167	1
142	144	1
169	171	1
152	154	1
137	138	0

Token Usage Log: 5 trials of 1:

Measured	Completion	Prediction	Rejected
154	159	4	4
148	153	4	4
137	141	4	3
147	152	4	4
152	157	4	4

Token Usage Log: 5 trials of 2:

Measured	Completion	Prediction	Rejected
143	216	353	72
138	211	353	72
160	233	353	72
145	218	353	72
151	180	353	28

The accounting is all goofy. “measured” and “prediction” are the response and the sent prediction measured by tiktoken, the rest are returned in the usage chunk. It can’t even be certain if I get billed for an empty string…

nikunj · November 5, 2024, 5:18am

@_j - the intended use case for this for tasks related to rewriting code or documents with minor changes. e.g. “refactor this code to change the variable name from x to y” or “rewrite this blogpost while only changing the name of the product from a to b”. In these cases, you pass the original draft as the prediction and then see inference speed up any time the model output and the predicted tokens match.

You shouldn’t expect this to help you with tasks where you don’t have a good sense of a long response before the model produces the response (which is what your prompt above about a story related to cute kittens is attempting to do).

_j · November 5, 2024, 5:25am

I’m pointing out that this is not passive in practice - it is impactful to performance if there are not hits, and speed is the only thing being paid for.

I’m sure there’s a break-even point you can get beyond, but that is one more prediction you have to make besides the text you have to predict yourself. gpt-4o-mini used above.

The next thing I’ll be curious about investigating is the intended use, and on shorter and longer runs of reproduction. For example, an AI adding <strong> bolding on important words of the same text passed as prediction.

vb · November 5, 2024, 5:40am

It’s quite easy to predict parts of the model’s output when it is asked to make changes to an existing piece of text that is also provided as input.
It is not about something the AI has written previously up to some point. This is not prompt caching.

In your test scenario, however, the model isn’t given any text to modify. Instead, you’re requesting an entirely new completion each time.

I suggest running your tests again, but this time using the example provided in the documentation, or something similar, to effectively leverage the model’s predictive capabilities.

merefield · November 5, 2024, 7:01am

Is this partly how Canvas works under the hood?

_j · November 5, 2024, 7:18am

Canvas? Could be if you experience faster generation speed there of existing canvas, but for you, not, because there is no function calling allowed. ChatGPT also builds long chats making context cache useful, and ends session interactions instead of tailing chats because of cache economy.

ChatGPT is encouraged to rewrite the whole canvas document to its tool instead of issuing patches or knowing line numbers (like the clever might do in their code app with chat window, line numbered original, and AI diff window, selecting sections for context and modifiability within). So if total computation is reduced, being reflected by the speed, certainly a good application.

Timeliness as a feature, on processing of predictable ins and outs, is kind of overruled by those predictable processing jobs being able to be done with dozens of calls a second (until you hit Cloudflare errors before your rate limit.) The actual app to take advantage is a head-scratcher for you to solve.

Diet · November 5, 2024, 7:30pm

I’m sorry - I’m a little slow here

I thought we’d finally be getting partial assistant output in prompts, but this is not that.

In fact, it looks like we’re billed normally for all output tokens, predicted or not, is that right?

I mean it’s cool tech, and I wonder how you guys are doing it (parallel generation? Multiple token prediction + skip ahead? Quants?)

But for this particular use case aren’t diffs/ fuzzy diffs much faster and cheaper?

anon10827405 · November 5, 2024, 7:46pm

If I understand it all correctly (on a very basic level) it’s first running a smaller model on the tokens, which in theory if it’s right it should be the exact same (like in code or large texts where a small body needs updating)

As long as the smaller model’s tokens line up with the predicted you don’t get charged.

But once the smaller model deviates the reign is passed to the actual model to now perform inference.

I was thinking (hoping) for the same. Doubt it’ll happen anytime soon considering the safety issues

I’m wondering if maybe the performance of diffs puts a strain on the quality of the output?

Diet · November 5, 2024, 9:47pm

I don’t see how or why

Although some implementations struggle with diffs, a fuzzy match (esp when it comes to line numbers) typically works

PaulBellow · November 5, 2024, 9:47pm

I’m thinking this might be useful for editing large texts. I haven’t tested yet, but I’ll report back if I do.

ETJames · November 13, 2024, 3:11pm

Does this have any bearing on the maximum input length?

PaulBellow · November 18, 2024, 1:00pm

This topic was automatically closed after 13 days. New replies are no longer allowed.

Topic		Replies	Views
When OpenAI predicted outputed input content is large, the effect is average? API gpt-4	1	142	December 16, 2024
Using predicted outputs for proofreading Feedback gpt-4o , predicted-outputs	1	242	January 22, 2025
Hypothetical Token-increase Strategy . Community gpt-4 , chatgpt	21	351	March 17, 2025
Feature Request: Token Adaptive Model API chatgpt , api	25	2173	August 8, 2023
Do 'MAX tokens' include the follow up prompts and completion in a single chat session API token	22	5444	August 25, 2023

Measured	Completion	Prediction	Rejected
143	216	353	72
138	211	353	72
160	233	353	72
145	218	353	72
151	180	353	28

Measured	Completion	Prediction	Rejected
143	216	353	72
138	211	353	72
160	233	353	72
145	218	353	72
151	180	353	28