Significant degradation in API response time when running in GCP Cloud Run

We are using OpenAI’s chat API (langchain Python module) to implement a document Q&A service. While testing the service locally on my developer machine (Mac Book Pro), the API response times are much smaller (varies from 2 - 5 seconds). However, once deployed as a GCP Cloud Run service, we observe significant degradation in the API response time (varies from 30 sec - 120 sec).

Here are some details regarding the implementation:

  1. Using the class ChatOpenAI from langchain module
  2. Tried both ‘gpt-3.5-turbo-0613’ & ‘gpt-3.5-turbo-16k-0613’ models. Haven’t observed much difference in performance
  3. We have to ask multiple questions to generate the expected results. Each question is being handled as a separate Chat API request. When parallelizing the API queries, the individual API response times worsen.
  4. Using Pinecone for storing the vectorized document data.
  5. The service is running on a single core instance in the cloud, since it uses only multi-threading (no multi-processor support added to the service)

Any suggestions to improve the API performance within the Cloud Service, would be truly appreciated!!!

Note: Not using API proxy

1 Like

@Foxalabs has some great suggestions

The delays we observed were due to h/w configuration issues with GCP. Once we upgraded our service to always have a dedicated CPU allocated, the API issues went away.

2 Likes

Thank you so much! This saved my day. Had similar issue because I had setup request based pricing and my service was running on websockets, guess cloud run only allocates resources based on HTTP requests