Introducing Predicted Outputs

@_j - the intended use case for this for tasks related to rewriting code or documents with minor changes. e.g. “refactor this code to change the variable name from x to y” or “rewrite this blogpost while only changing the name of the product from a to b”. In these cases, you pass the original draft as the prediction and then see inference speed up any time the model output and the predicted tokens match.

You shouldn’t expect this to help you with tasks where you don’t have a good sense of a long response before the model produces the response (which is what your prompt above about a story related to cute kittens is attempting to do).

5 Likes