For the OpenAI new Assistants API, I am looking at an example from the example notebook from here:
On the tutorial, it says the following:
‘’'Notice how the
Thread we created is not associated with the Assistant we created earlier!
Thread objects exist independently from Assistants, which may be different from what you’d expect if you’ve used ChatGPT (where a
Thread is tied to a model/GPT).“”
Does this mean that within each Thread, we can push in the maximum amount of tokens for a model through
Basically, we know that gpt-4-turbo has max token limit 128K. Does that mean that if create an Assistant, and split it to 5 threads, then each Thread can handle 128K? Or that 128K limit must be distributed among these 5 Threads?
The reason I ask about this question is because we have a tabular dataset (3000 rows) with schema doc_id (str) and text (str). Each document (text) is very long. We want to apply the chatcompletion API to each row with the new GPT4turbo model. Right now our process is that we send each row to the chatcompletion API one at a time with @retry. If Thread is not tied to the Assistants, then I can create multiple threads (say 10) and then for each thread let it run parallel to feed into the Assistant, and this will make our process faster, correct?
Good catch! I never realized that and yeah you only associate it upon running. So in theory, you can run your thread using different Assistants! Cool!
As for token limits, I think it is OpenAI who will manage it. You can probably run 100 messages that breaks the limit but under the hood, OpenAI truncates or do whatever magic they are doing to maintain token limit and context.
From the docs:
There is no limit to the number of Messages you can store in a Thread. Once the size of the Messages exceeds the context window of the model, the Thread will attempt to include as many messages as possible that fit in the context window and drop the oldest messages. Note that this truncation strategy may evolve over time.
So you are saying basically that I can use this Thread to “spread out” data so even though the overall limit of tokens is unchanged, I can use the trick for data engineering to increase my throughput so that the processing time can be shortened.
Right now our process is that if i have a dataset as I described previously, I have to apply the chatcompletion function to each row iteratively one by one. In the future, I can simply create many threads and send mini-batches to each thread in parallel, so that the overall processing time is reduced. Am I understanding this correctly?