We are using OpenAI’s chat API (langchain Python module) to implement a document Q&A service. While testing the service locally on my developer machine (Mac Book Pro), the API response times are much smaller (varies from 2 - 5 seconds). However, once deployed as a GCP Cloud Run service, we observe significant degradation in the API response time (varies from 30 sec - 120 sec).
Here are some details regarding the implementation:
- Using the class ChatOpenAI from langchain module
- Tried both ‘gpt-3.5-turbo-0613’ & ‘gpt-3.5-turbo-16k-0613’ models. Haven’t observed much difference in performance
- We have to ask multiple questions to generate the expected results. Each question is being handled as a separate Chat API request. When parallelizing the API queries, the individual API response times worsen.
- Using Pinecone for storing the vectorized document data.
- The service is running on a single core instance in the cloud, since it uses only multi-threading (no multi-processor support added to the service)
Any suggestions to improve the API performance within the Cloud Service, would be truly appreciated!!!
Note: Not using API proxy