You understood wrong.
max_tokens
is the limit of the response you will get back.
max_tokens
also reserves space exclusively for this response formation.
The context length of a model is first loaded with the input, and then the tokens that the AI generates are added after that, in the remaining space.
Language is formed in a transformer language model by continuing the next token that should logically appear one at a time based on previous input and generated response so far.
(In an ideal world, there would be two parameters, a response_limit
which would ensure that you don’t spend too much money, and a minimum_response_area_required
to throw an error if you provided too much input to allow expected response formation. However millions of developers and lines of implemented code use the existing system.)