I have been querying GPT 3.5 Turbo and have been facing the issue of huge variability in completion calls. It will be a great help if some solutions (or even potential causes) can be provided.
Are you using the streaming API, or just normal responses?
if streaming, are you observing these ~70s to the first token?
if you just use regular responses, do note that the output length significantly impacts the response time. you can limit that by setting the max token length ( although you might end up getting cut-off responses)
do you see any correlation between response length and time it took to deliver?
fyi: I don’t think query length should have a significant impact on response time.