Here is chat model data, which was programmatically extracted from error messages, using models listed by the models endpoint, which included fine-tune model probing.
Unfortunately the error reports for 3.5 models are different than for 4.0 with the same inputs. Getting the API to report the maximum output length restriction on newer 3.5 requires an input + max_tokens API call that GPT-4-turbo models could complete instead of erroring out on.
It is just a short list of fixed data, so I just edited the three models affected for this list (instead of writing and publishing even more-informed code for you to bang on the API yourself.)
model_context = {
"gpt-3.5-turbo": {
"context": 16385,
"max_out": 4096
},
"gpt-3.5-turbo-0125": {
"context": 16385,
"max_out": 4096
},
"gpt-3.5-turbo-0301": {
"context": 4097,
"max_out": 4097
},
"gpt-3.5-turbo-0613": {
"context": 4097,
"max_out": 4097
},
"gpt-3.5-turbo-1106": {
"context": 16385,
"max_out": 4096
},
"gpt-3.5-turbo-16k": {
"context": 16385,
"max_out": 16385
},
"gpt-3.5-turbo-16k-0613": {
"context": 16385,
"max_out": 16385
},
"gpt-4": {
"context": 8192,
"max_out": 8192
},
"gpt-4-0125-preview": {
"context": 128000,
"max_out": 4096
},
"gpt-4-0314": {
"context": 8192,
"max_out": 8192
},
"gpt-4-0613": {
"context": 8192,
"max_out": 8192
},
"gpt-4-1106-preview": {
"context": 128000,
"max_out": 4096
},
"gpt-4-1106-vision-preview": {
"context": 128000,
"max_out": 4096
},
"gpt-4-32k": {
"context": 32768,
"max_out": 32768
},
"gpt-4-32k-0314": {
"context": 32768,
"max_out": 32768
},
"gpt-4-32k-0613": {
"context": 32768,
"max_out": 32768
},
"gpt-4-turbo-preview": {
"context": 128000,
"max_out": 4096
},
"gpt-4-vision-preview": {
"context": 128000,
"max_out": 4096
}
}
from which you can extract:
>>>model_context["gpt-4"]["context"]
8192
On models with the same context as max, there is no artificial restriction on output, but you can’t actually set max_tokens that high. Instead you’d do something like (max_tokens=2000) - 8192 = maximum 6192 input tokens you can send, including overhead.