Performance question pinecone langchain chatgpt gradio python

I have written a pretty basic chat that includes python (3.11.5), pinecone, langchain, chatgpt, gradio. Total number of tokens of data in pinecone for what is worth is 4704 tokens across about 80 records. My average input output tokens is 33.

The code below should not be graded for syntax, I modified it to get it all together for viewing purposes. In the end, the code works and returns great information but the processing time is terrible. I would be good with 3-5 seconds but it is on average taking the longer side of 6-20 seconds to get a response back. Pinecone is usually less than 1.1 seconds of this and the rest falls on the RetrievalQA side.

I have done performance testing and it really falls down to the RetrievalQA where I am having slowness. I have fine-tuned the $*&^ out of the code, adding parallelization so that some other non-critical things I want to do can run within the larger time frame required by the vectorstore and RetrievalQA.

Here are some questions and other stats (way too much information but thought maybe something would spark a thought):

  • Pinecone - I am using the free version. I have loaded the data and that is it. no fine-tuning there. My data is as it was loaded. Would it benefit me at all to pay the 70/month for the upgraded version?
  • I am assuming that is a tiny set of data and would not see any performance gains on indexing the data? not even sure if that is an option but throwing it out there.
  • I am using the API for chatgpt and easily switch between gpt-3.5-turbo and gpt-3.5-turbo-0301 for the most part do not see any significant performance or response differences. I do not want to pay for 4, as I said above my results are great, but the execution is slow.
  • I am running on a workstation at home, would I see a huge boost in performance moving to an azure / amazon environment? pc specs below
  • I am running on an Intel(R) Core™ i7-8700 CPU @ 3.20GHz, 3192 Mhz, 6 Core(s), 12 Logical Processor(s) with 32gigs of ram Windows 11 latest build. Not sure the video card is in play but it is Intel(R) UHD Graphics 630. This was meant to be a home server, not a gaming machine :slight_smile:
  • the hard drive it is running off of is * 1 TB 7200 rpm HDD (3.5 in). Just now thought maybe I should get it on the 256 GB PCIe NVMe M.2 SSD. With that said the processing taking a long time is not on my end though
  • my internet connection is 1gb - not an issue

Last question is around the code and if anyone has any performance suggestions. As I said above the code below should not be graded for syntax, I modified it to get it all together for viewing purposes. RetrievalQA takes on the longer side of 3.5 to 16 seconds. The vectorstore is usually just under 1 second.

let me know if there are any areas of code I missed that I should have included.

Usual process time: 3.5 - 16 seconds
qa = RetrievalQA.from_chain_type(

	# model_name="gpt-3.5-turbo"


chain_type_kwargs={"prompt": PROMPT, "verbose": False}


Usual process time: .8 - 1.1 seconds
vectorstore.similarity_search(input_text, k=4)

Misc that could impact but added as FYI:

  • I have gone from k=1 to k=4 and see no real differences
  • My prompt may be too long but it is tight with respect to not answering non-topic questions.

prompt_template = f"""You are here to assist customers with their questions. Please provide helpful responses based on the context and questions you receive. If you need more information, feel free to ask for clarification. If you are unsure about a question or cannot provide a meaningful response, you can say “Please rephrase the question or reach out directly for more information” or request more information.


Question: {{question}}
Answer: Please feel free to expand on answers with relevant context to help me provide a better response.“”"

PROMPT = PromptTemplate(
template=prompt_template, input_variables=[“context”, “question”]

thanks for any input.