I’ve built an app in NodeJS using Langchain and chatgpt-3.5 and have created a memory vector store with a bunch of crawled pages using the recursive web loader from Langchain. After taking some time to process the page text the model fires up and I’m able to submit queries.
The problem I’m having is the responses are taking a long time to generate - some take 3 seconds, but sometimes it takes as long as 11 seconds, and I don’t know how to fix it. I have a prompt that defines how the model should respond, and there is admittedly a lot of data in the vector database, something on the order of ~33,000 paragraphs of text, but I imagine the production version of GPT 3.5 is way bigger and yet it responds within a couple seconds most of the time. I would love to get down to 2-3 seconds on average rather than ~8 seconds on average that I currently see.
Is there anything I could be missing? Could I improve the prompt? Is chatgpt-3.5 the wrong model to use for this use case?