Hi we are using the chatcompletion api and asking gpt3.5 to translate a small json file. The process is taking 3 minutes to return the translations. Any idea about possibilities on how to tune this ? we would like that the process taking just a few seconds.
thanks
Hi,
You may be able to take advantage of parallelism, i.e. splitting the task into sections and running them in parallel if that is suitable to your use case. If your task is not able to effectively take advantage of that then there is currently no low cost way to improve speed of inference. The only option would be a dedicated instance, that would only start to make sense at the 450million token per day usage mark.
Can you give an example of the prompt taking 3 mins?
hi Foxabilo, thank for the inputs .
Sure. tried also in the playground where it is also taking a while.
you are a translator please translate the json to brazilian portuguese language.
I cannot post the json since there are some links together
Ok, well you should be able to achieve 5 words per seconds with gpt-3.5 so 60 seconds should give you around 300 words, note that for json, a single { is an entire token and must be included.
thanks Foxabilo it was really insightful and helped a lot to understanding the boundaries
Keep in mind you pay by token, not by request. You may find you get better results if instead of sending a whole JSON blob you make individual requests with each item from the JSON. You’ll pay a bit more for the repeated instructions, but they look pretty short already. Smaller requests also give less of a chance for the model to go off the rails somehow.
sure Novaphil thanks for your inputs. We tried filtering only a few fields and not the entire json, but even though we got almost 2 minutes. Appreciate your advice on the costs. We will try the parallelism
Inference speed and token output will be increased if the input provided is small.
When the AI has to consider each of 4000 tokens when forming its answer (along with then considering the newly-formed answer itself while it continues), the computing requirements are higher and generation rate is lower than a similar task, but operating on 100 in and 100 out.
Here’s a start on parallel requests. Then you just have to write the splitting and batching.