Token use on Langchain PDF reader

daniel.bedford · September 9, 2023, 9:12pm

The below code enables me to produce answers on a PDF document (33 pages). However, it appears to have swallowed up my tokens very quickly. The responses were also not very accurate. Any advice on how to improve this (change my chunking strategy) or is there an alternative to Langchain that would produce better but also more cost-effective results?

from langchain.document_loaders import PyPDFLoader
from langchain.indexes import VectorstoreIndexCreator
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
import os 
from langchain.text_splitter import RecursiveCharacterTextSplitter

os.environ['OPENAI_API_KEY'] = '___'

loader = PyPDFLoader("data/33.pdf")
data = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size = 500, chunk_overlap = 0)
all_splits = text_splitter.split_documents(data)

vectorstore = Chroma.from_documents(documents=all_splits, embedding=OpenAIEmbeddings())

question = "What were the facts of the decision in the case of McDonald v Chelsea?"
docs = vectorstore.similarity_search(question)
len(docs)

from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.7)
qa_chain = RetrievalQA.from_chain_type(llm,retriever=vectorstore.as_retriever())
result = qa_chain({"query": question})

print(result)

GTT4 · September 11, 2023, 12:54pm

I am not an expert on this but have used the same code from a tutorial recently: I think your high token usage results from embedding the whole document anew every time you run the code. You can check that in the “Daily Usage Breakdown”. If you next question is how to save the embeddings to disk and retrieve from there instead, I am trying to learn that as well currently
I’d figure improving accuracy is what everyone is currently working at, so there might not be an “easy” solution yet. Just mess around with prompting, chunking, cleaning your pdf better etc. Also consider using GPT-4, shorter questions (more precise) or more “chains” inbetween that split your long (complex) question into a shorter one. That helps in my experience.

Topic		Replies	Views
Seeking Advice: Uploading Large PDFs for Analysis with GPT-3 API API gpt-35-turbo , chatgpt , fine-tuning , api	7	6252	December 13, 2023
Answering lots of questions from one large chunk of text without paying tokens to input the big text chunk for each question? API api	16	8214	December 24, 2023
Langchain app which reads PDFs using openai embeddings and model API gpt-4	2	2382	December 13, 2023
PDF summarizer using openai API	22	12309	January 2, 2024
Best way to process PDF File that has over 100k lines? API embeddings , gpt-35-turbo , api	5	6684	November 27, 2023

Token use on Langchain PDF reader

Related Topics