I tried to use the case provided on the official website, and there was indeed a good speed improvement. However, if the input content is 10 times that of the official website case, the speed does not change significantly with or without adding prediction parameters. Are there any developers with the same experience?`code = “”"
class User {
firstName: string = “”;
lastName: string = “”;
username: string = “”;
firstName: string = “”;
lastName: string = “”;
username: string = “”;
firstName: string = “”;
lastName: string = “”;
username: string = “”;
firstName: string = “”;
lastName: string = “”;
username: string = “”;
firstName: string = “”;
lastName: string = “”;
username: string = “”;
firstName: string = “”;
lastName: string = “”;
username: string = “”;
firstName: string = “”;
lastName: string = “”;
username: string = “”;
firstName: string = “”;
lastName: string = “”;
username: string = “”;
firstName: string = “”;
lastName: string = “”;
username: string = “”;
firstName: string = “”;
lastName: string = “”;
username: string = “”;
}
Employing predictions at all begins with a speed penalty baseline. The most minimal of prediction placed will lower the token production rate. This continues for a long time into actual application.
A document that is an exact match with the most minor of changes, such as “bold some words” receives a rate increase, but comes with rejected token costs in excess of the alterations.
I tried many possible scenarios to find and test the utility, and while situations like the latter could result in a speedup at higher expense, I really could think of no good scenario where it could be enabled on an arbitrary or task-based user input where the sum effect would not be slower outputs.
Any tool I could envision, like a code base canvas that has the AI output the whole document again, would rely on a gamble of verbatim AI output (which gpt-4o fails badly at in iteration), against a tool implementation that would simply be faster without wanting an entire reproduction.
I have the perception from the excess costs that the technology operates on longer runs of tokens, hence more modifications giving a total tally of accepted plus rejected tokens growing even higher than the actual amount sent as prediction.
I did not evaluate the case of sending a great deal more prediction input, ruling out paying 10x more for an AI generation of dubious benefit. If you are doing that, you are already gambling (would you send chat history for any possible match?). I also did not go into huge lengths, as going beyond response length training is just asking for more random AI alterations to a source.
A direct evaluation of your symptom would be to benchmark a recitation task with no prediction vs minimum and identity, and then far more.