Can one assistant run concurrently on multiple different threads at the same time?

I have an app whereby there’s one “master” assistant and a dedicated thread for each user of the app. Obviously it’s possible that many users can be using the app at the same time, and so the assistant needs to be able to run on each users thread at the same time so there’s no delays.

Is this currently possible // is this currently how it works? IE. if I invoke an assistant on different threads at the same time do these executions occur at the same time (meaning parallelism occurs)? Or is this not the case?

If it’s not the case, and that each thread invocation can only occur one at a time, then this is bad, and so I’m guessing I’ll have to assign 1 assistant per user to achieve parallelism?

Thanks.

2 Likes

Heh heh, now this is a fun topic.

Opening the pandora’s box for parallelism is going to end up being more complicated than you think if that’s your goal.

The API calls themselves are async (using asyncio I believe), so the API calls can be called in parallel. That wouldn’t be the danger here.

Honestly, there’s no better way to know if this works for your use case until you try it out for yourself. It’s at that point you’ll be able to tell whether or not the API will allow what you’re doing (my guess is it should).

The real danger is that true, genuine multithreading / parallelism doesn’t necessarily work the way you think if you’re building this in python. In fact, “real” parallelism is not natively possible at all because of what’s called the Global Interpreter Lock (the GIL) in python. It’s possible in other languages, but the vast majority of people are building in python, hence the assumption here.

I can already see people debating me on this because python can achieve equivalent results for use cases with high i/o throughput (like OAI’s API), but trying to call it parallelism requires an “um actually” here I can’t help but to clarify lol.

That being said, to answer your question requires some context about how you expect the assistant to be called and from where. It also may not be necessary to get into the nitty gritty of it, depending on what you’re expecting to build. For most people, it is not going to matter that you start two functions within ms of each other to start coroutines vs simultaneously on different threads. What is going to matter is that your await() function doesn’t block the execution of the rest of the code while it waits for the request to finish.

If your concern is more about handling user requests concurrently, that’s a “big picture” problem, and is where client/server relationships come into play. This is what people mean when they talk about things like “load balancing” and stuff like that. You would be in charge of accepting async messages in your code, and, well, handling them and directing where those requests should go in the rest of your code lol.

Finally, solving these problems, asking these questions, and executing solutions is going to be 10000x easier if you just assign 1 assistant to a user. It’s not going to suddenly allow multithreading to be a thing, but it will make it super easy to direct things around in your program if you do :wink:. This is partly why I recommended this in a previous topic of yours I believe. It allows you to focus more on the big picture problems of your future architecture than getting (chat) threads to match up right.

*NOTE: threads in parallelism contexts refers to CPU threads, not chat threads.

2 Likes

A “thread” is an extremely stupid name for OpenAI to have used. It is a chat conversation, along with the input placed there. It has nothing to do with nor does it imply some sort of threaded processing.

“Assistants” itself is also poorly named. Not only is it a role message already used in AI, but then you have both the whole endpoint concept and the instruction entity container both referred to as “assistants”.

So when we talk about creation of an assistant, you are just placing the instructions, and turning on other tools it might use. There’s not a CPU dedicated to just your assistant that can only service one user at a time.

You can run a whole product of thousands of users with one Assistant. Doing so is far wiser if you have knowledge that you pay by-the-assistant. Picture ChatGPT: There is one ChatGPT Plus GPT-4 that has the tools of DALL-E, Python, and web browser attached to it. Millions can use it, with their own chats and their own chat history they can return to.

In the same way, it is the threads that are the individual user chats. They can have personal files attached to them, have additional_instructions attached to them when a run is executed, can be shown as previous conversations, even be switched to a different assistant for the next input, etc. You need to maintain rigorously which user they belong to, even perhaps using the messages metadata to store a customer ID to ensure they are never replayed to the wrong user.

The only limitation (well, there’s tons of limitations and drawbacks…) is that you cannot interact with assistants API at more than 60 requests per minute. This makes it just that much more useless for multiple customers who all need their run generation constantly polled to see if it is done. Besides that is unsuitable to face customers that can say “do it 50 times” and hit you with dollars of billing from one input.

2 Likes

The use case can be presented differently; regardless of the implementation as opposed to terminology.

Assuming a master assistant that is conversing with 10s of users through threads, is this master assistant “aware” about all the conversations occuring in parallel?

Concretely let’s assume that the master assistant is an expert on Crypto coins. Let’s say there are 15 users on 15 threads conversing about BTC in one thread, LTC+ETH in another thread, SOL on third…

Now joins a 16th user ; asking what are different users currently interested in? How to answer from the perspective of a master assistant?

I understand that thinking inside-the-box may make it sound impossible. But it is not.