Hi everyone,
Hope you are well.
I am looking for some advice regarding a small use case I have. Basically, I am building a small application that is using a combination of RAG & GPT4 prompts.
With the RAG part, I retrieve documents that usually have a length of 2 to 10 pages. Then I submit them as a context into my prompt to answer a specific question.
Considering the size of context, I am using chain_type = map_reduce to process the whole context progressively & execute my prompt with GPT4 (I am using langchain).
def gpt4_query_mapreduce(prompt, input_documents):
from langchain.chat_models import ChatOpenAI
from langchain.chains.question_answering import load_qa_chain
import time
time.sleep(3) #Pause to avoid overloading open ai API
#Initiate the LLM model access
llm = ChatOpenAI(openai_api_key="xxxxx", model = "gpt-4")
#Initiative query chain mode.
chain = load_qa_chain(llm, chain_type="map_reduce",verbose=True) #"Stuff" chain type is used because we will only submit a small piece of text that is prefiletered with the semantic search
answer = chain.run(input_documents=input_documents, question=prompt, return_only_outputs=True) #Submit the analysis to GPT4 for final check / relevancy verification / Semantic cleaning
return answer
Here is the deal. The time to process is extremely long, sometimes it takes more than 15 minutes to execute the prompt. I can understand large documents used as a context in a prompt can create latency, I can also understand that GPT4 is not the fastest, still, I feel this execution time is particularly huge.
Therefore, I was wondering if there was any method to speed up the process? Is there a way to use map_reduce with a parallel process? Maybe the RAG documents that are too large? Anything else?
Thank you in advance for your help