Not allowed to have all 8192 tokens

You understood wrong.

max_tokens is the limit of the response you will get back.

max_tokens also reserves space exclusively for this response formation.

The context length of a model is first loaded with the input, and then the tokens that the AI generates are added after that, in the remaining space.

Language is formed in a transformer language model by continuing the next token that should logically appear one at a time based on previous input and generated response so far.

(In an ideal world, there would be two parameters, a response_limit which would ensure that you don’t spend too much money, and a minimum_response_area_required to throw an error if you provided too much input to allow expected response formation. However millions of developers and lines of implemented code use the existing system.)