I have been using ‘gpt-3.5-turbo’ as the model in my API calls. Should I always be using one of the specific context models ‘-4k’ or ‘-16k’, depending on my needs? What context size is used if you just set model to ‘gpt-3.5-turbo’?
And when should I use a dated model like ‘gpt-3.5-turbo-0613’?
I want to do more specific tracking of my token pricing.
You have some observations where documentation is often not exacting and an AI can’t answer.
The standard context length of gpt-3.5-turbo is 4k - 4191 available tokens, that must hold both all the input you send and the response you want the AI to generate. You don’t and can’t specify -4k as a suffix.
The name redirects to the currently recommended model, gpt-3.5-turbo-0613, and your reply metadata and accounting will show this as the model used when you invoke by calling the model gpt-3.5-turbo.
Employing the -16k model costs you exactly twice for all interactions, even if your API call doesn’t need its features. So no, you wouldn’t want to use it unless your input or output calls for the larger context length. A strategy could be to select it dynamically in software specifically for allowed cases of larger inputs, for example.
Token tracking and counting within software can be done by a library that computes the encoding, using the model’s actual internal AI encoding dictionary, such as tiktoken. This allows strategies, like seeing exactly how large the past conversation replies of old chat are, to decide what must be discarded.
General usage by tokens can also be seen in the usage page of your OpenAI account, summarized by 5-minute granularity in the daily view.
Hopefully this informs the bigger question you have in mind!