While generating a response, the API’s models have various probabilities of producing stop tokens in various situations. Sometimes the caller just knows that this is the wrong behavior, and the model should create new output until it hits max_tokens (or until one of the caller-specified stop sequences is encountered).
In theory the caller can simulate this behavior by calling the API repeatedly (if they have no custom stop sequences), but it is much more expensive to do that. It would be much easier to be able to ask the sampling code to reject the stop token and retry [unless it has acquired probability 1.0, which seems like an unlikely corner case that may deserve a custom error].
The side point about custom stop sequences is a subtle corner case and perhaps not that important. If you don’t provide any stop sequences, you can get the desired result by repeatedly calling back to Completions.create with the previous generation appended to your prompt, until you hit max_tokens. But if you provide any custom stop sequences, you can’t tell the difference between stopping due to one of those (which you may want), and stopping because of the default stop token (which you want to override and keep generating).
yeah that one’s a special token. a model could generate the characters <endoftext> but that wouldn’t stop the sequence because it isn’t the special token