What's the appropriate way to convert pdfs to text files?

blippyblopblop · March 28, 2023, 4:00am

I want to take my personal library of scientific articles (pdf form) and use them for fine tuning. I’d also like to convert them to vectors for using them in neural search.

I’m new to coding. I’ve been messing with the different Python libraries to try and extract text from the pdfs. I can extract the text into a big text file, but it’s kind of a mess. And there are superscript numbers for references, figure text, the figures themselves,as well as different sections such as abstract and bibliography. How do I extract the data to where it still has some format and doesn’t have a bunch of junk in it that’s going to create noise for training and search?

ddrechsler · March 28, 2023, 5:02am

I’m using this code to extract text from PDf’s to generate summaries

response = urllib.request.urlopen(url)
pdf_file = BytesIO(response.read())
pdfReader = PyPDF2.PdfFileReader(pdf_file)
text = ""
for page in range(pdfReader.numPages):
    text += pdfReader.getPage(page).extractText()

you could generate an embedding with that like

openai.api_key = "you api key"
response = openai.Embedding.create(input=fit_within_token_limit(text), 6000),
                                                   model="text-embedding-ada-002")
vEmbedding=response['data'][0]["embedding"]

and two helper functions for counting tokens, then fitting text within a token limit. Note I have to use TextBlob as I’m stuck with python 3.7 and tiktoken requires 3.8…grrr

def count_tokens(vCountTokenStr):
    # Tokenize the input string
    blob = TextBlob(vCountTokenStr)
    tokens = blob.words

    # Count the number of tokens
    num_tokens = len(tokens)
    return num_tokens

def fit_within_token_limit(text, token_limit):
    remaining_tokens = token_limit
    shortened_text = text

    while count_tokens(shortened_text) >token_limit:
        # Reduce the length of the text by 10% and try again
        shortened_length = int(len(shortened_text) * 0.9)
        shortened_text = shortened_text[:shortened_length]

    return shortened_text

Hope that helps out a bit!
Dale

jochenschultz · March 28, 2023, 12:59pm

Convert it to tiff, then use tesseract OCR (e.g. pytesseract) and convert to hocr.
You don’t want to lose the positioning of elements.

blippyblopblop · March 29, 2023, 2:57am

Thanks Dale - does the noise (e.g. references) from the pdf impact your ability to apply the embeddings?

ddrechsler · March 29, 2023, 3:19am

honestly, embeddings seem pretty robust, I dont think the references will throw it off too much. I guess you’ll need to do. a couple of tests to be sure. Search for terms that will have a reference in the middle and double check they’re still returning a good value when you run your cosine check. my cosine_similarity function being:

def cosine_similarity(a, b):
    dot_product = np.dot(a, b)
    norm_a = np.linalg.norm(a)
    norm_b = np.linalg.norm(b)
    similarity = dot_product / (norm_a * norm_b)
    return similarity

another forum user mentioend there is a built in openai function for that too in the python module, but this one seems to work nicely. (thanks chatGPT)

ddrechsler · March 29, 2023, 3:21am

another tactic might be too look at the extracted text and then ask chatgpt to write you a nice regular expression to remove however the references are formated from you text. might work if they are conistent.

Topic		Replies	Views
Converting PDF Files Text into Embeddings API	4	17586	December 18, 2023
Accurately read PDF files? API	12	54044	December 12, 2023
Making embeddings more accurate? API embeddings	7	1651	December 17, 2023
Training with Large PDF FIles API	10	17015	December 15, 2023
Using large PDFs to make a ChatBot API chatgpt , api	21	4233	December 15, 2023

What's the appropriate way to convert pdfs to text files?

Related Topics