How does one scale a chat application for production? Is async necessary?

I’ve been learning to program for 5 months and have built a chat web app with OpenAI’s Python API. I used FastAPI and enabled asynchronous requests via the recently added async acreate method in my most recent iteration. I’m trying to understand how I would scale an application like this for production with a capacity for potentially thousands of users.

I know that I could scale an app by creating multiple instances at deployment, but I wonder if this is the most efficient way because it seems like it could get expensive quickly, perhaps there’s a way to maximize the capacity of each instance. I just to make sure the app is not the bottleneck and I can scale efficiently.

  • Does enabling async allow more concurrent requests within the same instance? By definition I assume it does, is it essential for production?
  • How would i test my apps concurrency limits without charging myself a huge amount in API costs?
  • How many concurrent requests can I expect one instance of my app to be able to handle without bottlenecking and queuing requests?

This is my async generator function:

async def generate(messages: List[Message], model_type: str):
        response = await openai.ChatCompletion.acreate(
            messages=[message.dict() for message in messages],

        async for chunk in response:
            content = chunk['choices'][0]['delta'].get('content', '')
            if content:
                yield content

    except RateLimitError as e:
        yield f"{str(e)}"

I agree with everything ruckus said.

Just want to add a tiny note:
When an asynchronous function is called, it may return a special type of object called a “promise” or a “future.” A promise object represents the eventual completion or failure of an asynchronous operation. It serves as a placeholder for the result that will be available at some point in the future. You can think of it as a container that holds the value or error that will be produced when the asynchronous operation completes. If you see the response promise it’s because your function haven’t been awaited properly :smiley:

The use of async can lead to race conditions when the order of operations is crucial but not strictly enforced, causing unexpected behavior due to the concurrent execution of these operations. This typically arises when multiple tasks access or modify shared data without synchronization, leading to conflicts and inconsistencies in the final outcome.


Can I know the datatype of individual message you are using. As I am getting an error of is

InvalidRequestError: [{‘role’: ‘system’, ‘content’: ‘You are a helpful assistant’}, {‘role’: ‘user’, ‘content’: ‘hi’}] is not of type ‘object’ - ‘messages.0’