What's the appropriate way to convert pdfs to text files?

I want to take my personal library of scientific articles (pdf form) and use them for fine tuning. I’d also like to convert them to vectors for using them in neural search.

I’m new to coding. I’ve been messing with the different Python libraries to try and extract text from the pdfs. I can extract the text into a big text file, but it’s kind of a mess. And there are superscript numbers for references, figure text, the figures themselves,as well as different sections such as abstract and bibliography. How do I extract the data to where it still has some format and doesn’t have a bunch of junk in it that’s going to create noise for training and search?

1 Like

I’m using this code to extract text from PDf’s to generate summaries

response = urllib.request.urlopen(url)
pdf_file = BytesIO(response.read())
pdfReader = PyPDF2.PdfFileReader(pdf_file)
text = ""
for page in range(pdfReader.numPages):
    text += pdfReader.getPage(page).extractText()

you could generate an embedding with that like

openai.api_key = "you api key"
response = openai.Embedding.create(input=fit_within_token_limit(text), 6000),
                                                   model="text-embedding-ada-002")
vEmbedding=response['data'][0]["embedding"]

and two helper functions for counting tokens, then fitting text within a token limit. Note I have to use TextBlob as I’m stuck with python 3.7 and tiktoken requires 3.8…grrr

def count_tokens(vCountTokenStr):
    # Tokenize the input string
    blob = TextBlob(vCountTokenStr)
    tokens = blob.words

    # Count the number of tokens
    num_tokens = len(tokens)
    return num_tokens

def fit_within_token_limit(text, token_limit):
    remaining_tokens = token_limit
    shortened_text = text

    while count_tokens(shortened_text) >token_limit:
        # Reduce the length of the text by 10% and try again
        shortened_length = int(len(shortened_text) * 0.9)
        shortened_text = shortened_text[:shortened_length]

    return shortened_text

Hope that helps out a bit!
Dale

2 Likes

Convert it to tiff, then use tesseract OCR (e.g. pytesseract) and convert to hocr.
You don’t want to lose the positioning of elements.

2 Likes

Thanks Dale - does the noise (e.g. references) from the pdf impact your ability to apply the embeddings?

honestly, embeddings seem pretty robust, I dont think the references will throw it off too much. I guess you’ll need to do. a couple of tests to be sure. Search for terms that will have a reference in the middle and double check they’re still returning a good value when you run your cosine check. my cosine_similarity function being:

def cosine_similarity(a, b):
    dot_product = np.dot(a, b)
    norm_a = np.linalg.norm(a)
    norm_b = np.linalg.norm(b)
    similarity = dot_product / (norm_a * norm_b)
    return similarity

another forum user mentioend there is a built in openai function for that too in the python module, but this one seems to work nicely. (thanks chatGPT)

another tactic might be too look at the extracted text and then ask chatgpt to write you a nice regular expression to remove however the references are formated from you text. might work if they are conistent.