I’ve been learning to program for 5 months and have built a chat web app with OpenAI’s Python API. I used FastAPI and enabled asynchronous requests via the recently added async acreate
method in my most recent iteration. I’m trying to understand how I would scale an application like this for production with a capacity for potentially thousands of users.
I know that I could scale an app by creating multiple instances at deployment, but I wonder if this is the most efficient way because it seems like it could get expensive quickly, perhaps there’s a way to maximize the capacity of each instance. I just to make sure the app is not the bottleneck and I can scale efficiently.
- Does enabling async allow more concurrent requests within the same instance? By definition I assume it does, is it essential for production?
- How would i test my apps concurrency limits without charging myself a huge amount in API costs?
- How many concurrent requests can I expect one instance of my app to be able to handle without bottlenecking and queuing requests?
This is my async generator function:
async def generate(messages: List[Message], model_type: str):
try:
response = await openai.ChatCompletion.acreate(
model=model_type,
messages=[message.dict() for message in messages],
stream=True
)
async for chunk in response:
content = chunk['choices'][0]['delta'].get('content', '')
if content:
yield content
except RateLimitError as e:
yield f"{str(e)}"