YOU can predict what the AI will write for a good portion of contiguous output, and
can disable several features, including cost limitations, and
want to pay separate output price for anything you āpredictā that isnāt used.
then:
response may come faster
previous:
I think it is sufficient to say āI donāt get itā.
It seems it is not:
complete on a partial assistant output
output the AI can reuse until there is significant departure
suggest a response
Perhaps it needs to be exactly what the AI would write up to a point - but of course with changing inputs and random sampling, that āexactlyā could change at the first token.
Is the only thing that one would possibly see is a speedup - with no other evidence it is working?
I think much more documentation of the mechanism, the run length of tokens it operates on, whether chunks can be provided, or a completion up to your own expected divergence should be provided, etc.
Yeah, it wonāt change the output of the model at all. It just speeds it up if you can guess a big amount of what the model replies with. This can happen if youāre asking for edits on a big doc or code file.
The sending of a prediction seems to have an opposite relationship to the desired operation. Lower output speeds.
Using the prediction parameter gets you slower, more expensive AI instead of no effect when there is no match.
Predictions sent for a task of writing about cute kittens
(nothing cache-able; top_p: 0.0001 not at all making a regular length):
prediction 0: (empty string)
prediction 1: ākitten kittens cuteā
prediction 2: the text part of the predicted output documentation
Calls are interleaved.
For 5 trials of prediction 0 @ 2024-11-04 08:44PM:
Stat
Average
Cold
Minimum
Maximum
stream rate
Avg: 123.340
Cold: 103.2
Min: 103.2
Max: 154.4
latency (s)
Avg: 0.537
Cold: 0.5426
Min: 0.37
Max: 0.9877
total response (s)
Avg: 1.788
Cold: 2.1319
Min: 1.5359
Max: 2.1319
total rate
Avg: 86.413
Cold: 77.396
Min: 77.396
Max: 92.525
response tokens
Avg: 153.000
Cold: 165
Min: 137
Max: 169
cost tokens
Avg: 154.800
Cold: 167
Min: 138
Max: 171
prediction tokens
Avg: 0.000
Cold: 0
Min: 0
Max: 0
accepted tokens
Avg: 0.000
Cold: 0
Min: 0
Max: 0
rejected tokens
Avg: 0.800
Cold: 1
Min: 0
Max: 1
For 5 trials of prediction 1 @ 2024-11-04 08:44PM:
Stat
Average
Cold
Minimum
Maximum
stream rate
Avg: 88.180
Cold: 85.5
Min: 54.6
Max: 115.8
latency (s)
Avg: 0.801
Cold: 0.5172
Min: 0.3544
Max: 1.806
total response (s)
Avg: 2.568
Cold: 2.3072
Min: 1.658
Max: 3.4577
total rate
Avg: 62.057
Cold: 66.748
Min: 39.622
Max: 91.677
response tokens
Avg: 147.600
Cold: 154
Min: 137
Max: 154
cost tokens
Avg: 152.400
Cold: 159
Min: 141
Max: 159
prediction tokens
Avg: 4.000
Cold: 4
Min: 4
Max: 4
accepted tokens
Avg: 0.000
Cold: 0
Min: 0
Max: 0
rejected tokens
Avg: 3.800
Cold: 4
Min: 3
Max: 4
For 5 trials of prediction 2 @ 2024-11-04 08:44PM:
Stat
Average
Cold
Minimum
Maximum
stream rate
Avg: 91.560
Cold: 86.8
Min: 68.8
Max: 118.1
latency (s)
Avg: 0.509
Cold: 0.6555
Min: 0.3166
Max: 0.6555
total response (s)
Avg: 2.156
Cold: 2.2915
Min: 1.7055
Max: 2.3228
total rate
Avg: 69.247
Cold: 62.405
Min: 59.813
Max: 85.019
response tokens
Avg: 147.400
Cold: 143
Min: 138
Max: 160
cost tokens
Avg: 211.600
Cold: 216
Min: 180
Max: 233
prediction tokens
Avg: 353.000
Cold: 353
Min: 353
Max: 353
accepted tokens
Avg: 0.000
Cold: 0
Min: 0
Max: 0
rejected tokens
Avg: 63.200
Cold: 72
Min: 28
Max: 72
Token Usage Log: 5 trials of 0:
Measured
Completion
Prediction
Accepted
Rejected
165
167
0
0
1
142
144
0
0
1
169
171
0
0
1
152
154
0
0
1
137
138
0
0
0
Token Usage Log: 5 trials of 1:
Measured
Completion
Prediction
Accepted
Rejected
154
159
4
0
4
148
153
4
0
4
137
141
4
0
3
147
152
4
0
4
152
157
4
0
4
Token Usage Log: 5 trials of 2:
Measured
Completion
Prediction
Accepted
Rejected
143
216
353
0
72
138
211
353
0
72
160
233
353
0
72
145
218
353
0
72
151
180
353
0
28
The accounting is all goofy. āmeasuredā and āpredictionā are the response and the sent prediction measured by tiktoken, the rest are returned in the usage chunk. It canāt even be certain if I get billed for an empty stringā¦
@_j - the intended use case for this for tasks related to rewriting code or documents with minor changes. e.g. ārefactor this code to change the variable name from x to yā or ārewrite this blogpost while only changing the name of the product from a to bā. In these cases, you pass the original draft as the prediction and then see inference speed up any time the model output and the predicted tokens match.
You shouldnāt expect this to help you with tasks where you donāt have a good sense of a long response before the model produces the response (which is what your prompt above about a story related to cute kittens is attempting to do).
Iām pointing out that this is not passive in practice - it is impactful to performance if there are not hits, and speed is the only thing being paid for.
Iām sure thereās a break-even point you can get beyond, but that is one more prediction you have to make besides the text you have to predict yourself. gpt-4o-mini used above.
The next thing Iāll be curious about investigating is the intended use, and on shorter and longer runs of reproduction. For example, an AI adding <strong> bolding on important words of the same text passed as prediction.
Itās quite easy to predict parts of the modelās output when it is asked to make changes to an existing piece of text that is also provided as input.
It is not about something the AI has written previously up to some point. This is not prompt caching.
In your test scenario, however, the model isnāt given any text to modify. Instead, youāre requesting an entirely new completion each time.
I suggest running your tests again, but this time using the example provided in the documentation, or something similar, to effectively leverage the modelās predictive capabilities.
Canvas? Could be if you experience faster generation speed there of existing canvas, but for you, not, because there is no function calling allowed. ChatGPT also builds long chats making context cache useful, and ends session interactions instead of tailing chats because of cache economy.
ChatGPT is encouraged to rewrite the whole canvas document to its tool instead of issuing patches or knowing line numbers (like the clever might do in their code app with chat window, line numbered original, and AI diff window, selecting sections for context and modifiability within). So if total computation is reduced, being reflected by the speed, certainly a good application.
Timeliness as a feature, on processing of predictable ins and outs, is kind of overruled by those predictable processing jobs being able to be done with dozens of calls a second (until you hit Cloudflare errors before your rate limit.) The actual app to take advantage is a head-scratcher for you to solve.
If I understand it all correctly (on a very basic level) itās first running a smaller model on the tokens, which in theory if itās right it should be the exact same (like in code or large texts where a small body needs updating)
As long as the smaller modelās tokens line up with the predicted you donāt get charged.
But once the smaller model deviates the reign is passed to the actual model to now perform inference.
I was thinking (hoping) for the same. Doubt itāll happen anytime soon considering the safety issues
Iām wondering if maybe the performance of diffs puts a strain on the quality of the output?