How to extract technical expressions from PDFs so that they can be understood by AI?

I agree with Eric’s assesment,

Although I do have some reservations about the tex source, in the example used there’s a single \newcommand defined, but the code itself contains several different custom commands, stuff like \pat and tbool

These commands are most likely defined elsewhere in the document source, this issue might be solved with an extended context window.

But we’re out of luck, if theses commands are defined in non-standard latex packages.

1 Like

Yes this is a valid point that must be addressed to pass a subset of the TeX or LaTeX to ChatGPT.

Follow up (Click triangle to expand)

In order to work with a larger body of TeX switching to

The book

  • is open source on GitHub
  • is free
  • needs little to no prior knowledge to understand
  • has various representations of technical expressions in TeX that can been viewed in human form in the PDF format.

For this truth table

image

the TeX is

https://github.com/rzach/forallx-yyc/blob/dd26ca3043c9288fc548836e136f2416ed70d93d/forallx-yyc-truthtables.tex#L21-L26

Giving that to ChatGPT as a new conversation

Prompt

\begin{tabular}{c|c}
\metav{A} & \enot\metav{A}\\
\hline
T & F\\
F & T 
\end{tabular}

Completion

The table you provided represents the truth values of a formula and its negation. Let’s break it down:

  • \metav{A}: This is a placeholder representing a statement or proposition. It can take on the truth values True (T) or False (F).

  • \enot\metav{A}: This is the negation of \metav{A}. It represents the opposite truth value of \metav{A}. If \metav{A} is True, then \enot\metav{A} is False, and vice versa.

The table shows the possible truth values for \metav{A} and \enot\metav{A}. When \metav{A} is True, \enot\metav{A} is False. When \metav{A} is False, \enot\metav{A} is True. This is consistent with the logical operation of negation.

So, the table reflects the basic principles of negation in classical logic.

Personal note:
Obviously ChatGPT can understand the TeX but some of the TeX needs to be pretty printed.


Here is a refined prompt in a new conversation that addresses the evident problems in the first completion.

Prompt

LaTeX commands and meaning
\enot - `not` symbol
\metav - meta variable

```LaTeX
\begin{tabular}{c|c}
\metav{A} & \enot\metav{A}\\
\hline
T & F\\
F & T 
\end{tabular}
```

* format for readability by a human
* convert LaTeX  commands to symbols
* do not explain LaTeX 
* do not explain Markdown or other formatting
* explain expression for a beginner

Completion

A | ¬A
------
T | F
F | T

Explanation for a beginner:
This table represents the truth values of a statement A and its negation (¬A). The variable A can take on two truth values: True (T) or False (F). The table shows all possible combinations of truth values for A and its negation ¬A. When A is True, ¬A is False, and vice versa.


Since TeX is actually code with macros, variables and state, such information needs to be passed to ChatGPT to correctly process and/or pretty print the TeX.

While smaller examples of TeX can be passed in a single prompt, there will be TeX that is larger than the token limit and so can not be passed in. Possible solutions would be to use existing code that parses TeX and maybe even runs part of the TeX to establish the state at the time, or to train an LLM so that it can infer reasonable values without seeing all of the TeX.

@elm have you tried the InftyReader that @N2U posted above? What can you tell us about it?

I believe it was @Foxalabs who brought infreader to our attention, any thoughts on this?

As mentioned by Eric, optical character recognition of complex equations seems to be an issue many have struggled with.

1 Like

With regards to PDFs as an image of the documents they are only good for passing to OCR.

As for trying to extract technical expressions from PDFs that use PostScript, that too is not a good solution as the PostScript is more about giving positions of where to place graphics, all of the metadata from a source such as LaTeX is lost which makes the task much harder. I will note that I did try InftyProject and for a PDF that uses PostScript it is the best I have seen of the few things mentioned.

Years ago I was dissuaded by the fact that everywhere one looked at the problem, everything noted and all apps for sale were based on PDFs that used PostScript. After working with that for a while realized while possible there has to be something with much higher accuracy. Now that many PDFs can be downloaded with the original LaTeX, as noted in the example earlier this is much more productive as not only is the technical expression there in a linear format but the metadata helps the LLMs expand on the result and is much quicker. There are still some dots that need connecting but nothing that appears as a show stopper.

It’s been a bit quiet here for the past few weeks, but I thought I would share an update on my progress.

Eric’s statement here is indeed correct:

So our first problem is finding a proper dataset to work with, I did some digging and I found this collection of 100k images of formulas with their associated latex source

Next we’ll need a functional ORC system and I found this paper describing such a thing:

https://dl.acm.org/doi/10.1145/3395027.3419580

This research introduces PDF2LaTeX, a unique OCR system designed to extract mathematical content and text from PDFs and convert them into LaTeX markup. They use a Conditional Random Field (CRF) to distinguish between regular text and mathematical expressions, they then use two separate models based on a combination of CNN and LSTM (Long Short-Term Memory) neural network architectures, To translate image blocks of both math expressions and plain text into LaTeX.

While their OCR system outperformed InftyReader, it still has some issues with math as shown in the table below

1 Like

Thanks! :tada:

Going after PDFs as images was not high on my list, PDFs based on PostScript was much higher (think low hanging fruit), but still is a case that is needed.

Took the first item from the list im2latex_formulas.lst to see how well the LaTeX could be understood by ChatGPT (Model: GPT 3.5) and used to help a person.


ChatGPT August 3 Version
Model: GPT-3.5
Prompt

For the following LaTeX for a math expression

  • display as a math expression
  • explain to a university student
  • if it is has a specific use include internet references such as wikipedia
\int_{-\epsilon}^\infty dl\: {\rm e}^{-l\zeta}	\int_{-\epsilon}^\infty dl' {\rm e}^{-l'\zeta}	ll'{l'-l \over l+l'} \{3\,\delta''(l) - {3 \over 4}t\,\delta(l) \} =0.		\label{eq21}

Completion

Display as a math expression:

The given LaTeX expression is:

[
\int_{-\epsilon}^\infty dl : e^{-l\zeta} \int_{-\epsilon}^\infty dl’ : e^{-l’\zeta} \frac{ll’(l’-l)}{l+l’} \left{3\delta’'(l) - \frac{3}{4}t\delta(l)\right} = 0 \quad \text{Eq. (21)}
]

image

Explain to a university student:

In this math expression, we have an equation represented in LaTeX notation. Let’s break down the components step by step:

  1. The expression starts with two integrals:

    (\int_{-\epsilon}^\infty dl) represents the integration over the variable (l) from (-\epsilon) to positive infinity.

    (\int_{-\epsilon}^\infty dl’) represents the integration over the variable (l’) from (-\epsilon) to positive infinity.

  2. Inside the integrals, we have (e^{-l\zeta}) and (e^{-l’\zeta}), which are exponential functions with (l) and (l’) as variables, respectively. The Greek letter (\zeta) seems to be a constant.

  3. Next, we have (ll’(l’-l)), which is a product of three terms involving the variables (l) and (l’).

  4. After that, we have a fraction (\frac{{l’l(l’-l)}}{{l+l’}}), where (l+l’) appears in the denominator.

  5. Then, we have the curly braces ({ }), which contain two terms: (3\delta’‘(l)) and (\frac{3}{4}t\delta(l)). These are delta functions (\delta(l)) and (\delta’'(l)) with coefficients (3) and (\frac{3}{4}t), respectively. The variable (t) seems to be some constant.

  6. Finally, the whole expression is equated to zero, indicating that this entire mathematical expression equals zero.

Please note that without further context or additional information about the variables and constants used in this expression, it is challenging to provide a specific interpretation of the equation’s meaning or its applications. It appears to be a mathematical equation involving integrals and delta functions, but the precise significance depends on the specific field or context it is used in.

Specific Use and Internet References:

Without specific context or information about the origin of the equation, it’s challenging to provide a direct internet reference or source. However, some components of the expression, such as the use of integrals and delta functions, are common in various mathematical fields, including calculus, differential equations, and theoretical physics. If you can provide more context or the field of study where this equation is relevant, I might be able to provide more specific references or explanations.


What would be interesting to know is for the images that generated LaTeX that were not accurate how correct is the expression? What is the confidence level that the expression is correct? In other words it is at times easier for a human to check simple facts that are a hallucination than it is to check an expression that is a hallucination or partial hallucination.


The page did not seem to include the results that were invalid but returned LaTeX, would have been nice to see that list included. If I missed those please point them out.

1 Like

That paper is nearly three years old now. I know InftyReader has had many updates since then and I’m guessing PDF2LaTeX has as well.

I’d be interested to see a newer comparison.

Here’s a more recent paper which discusses some of the issues with PDF processing.

2 Likes

Thank you for sharing that paper, it was an interesting read, it also mentions some of the other issues associated with interpreting PDFs, like extracting figures and tables. I’ll definitely recommend it for anyone who wants to better understand the issue’s that can come with extracting data from PDF’s.

Unfortunately their sources concerning the extraction of mathematical equations are also from 2020, but it is still very interesting, so I’ll share one of the cited papers here:

https://doi.org/10.1111/exsy.12576

I’d be very interested in seeing a newer comparison myself.

1 Like

I don’t think there’s any failure rate included for the im2latex dataset, but the PDF version of the PDF2LATEX paper does include some failed output compared to the original latex source in the appendix.

From what’s included there it seems that there’s still a few issues with mathematical equations, but it looks far better than inftyReader.

The paper shared by elmstedt has some interesting paragraphs about how the context to understand said equations is almost always included within the text, in that case.

Misinterpreting a Greek letter might not be a an issue as long as it’s consistently wrong, ie it doesn’t matter if my \Theta is interpreted as \Omicron as long as the same letter appearing in both the equation and the accompanying explanation.

1 Like

“Nougat: Neural Optical Understanding for Academic Documents” by Lukas Blecher, Guillem Cucurull, Thomas Scialom and Robert Stojnic (pdf)

2 Likes

I knew I recognized the name of the first author[1]!

I do think this is exciting work but they have some hurdles to clear with the repetition issue and doing some optimization to speed up the process.

I wish they included comparisons to other work like InftyReader and Mathpix in their results section.


  1. ↩︎

While it is only one example, it is more than nothing.

https://leanprover.zulipchat.com/#narrow/stream/219941-Machine-Learning-for-Theorem-Proving/topic/Nougat.2C.20academic.20OCR.20model.20from.20Meta.20AI/near/387714586

1 Like

Hello everyone! I hope my message finds you well.

I am interested in developing a personal chatbot using the latest Langchain framework. The chatbot will be designed to read and analyze academic papers. Although I found several online tutorials on creating chatbots with Langchain, none of them mentioned the limitation of LLM in interpreting technical expressions such as math and other type rules. It seems that the common Python package used for interpreting PDF files in these projects is Pypdf2.

I personally tested ChatGPT 4’s file uploader and asked it to explain the mathematical model to me. However, the results were unsatisfactory, leading me to seek assistance on this discussion board.

Please forgive my limited programming background. I am searching for a solution mentioned in this discussion thread or something that I can integrate into the Langchain framework. Maybe the Nougat? Or the Latex OCR project? I’m not sure. Thank you all for your helps!

By the way, I have noticed that there are some AI platforms available which can assist you in reading scientific papers, but unfortunately they are not open source. Some platforms, such as Scite and SCISPACE appear to have the function of extracting technical expressions. However, I am not inclined to trust them as they are not open-sourced. I would like to hear your thoughts on them, too.

Bests,

Mathpix is the best at this at present. Nothing else comes close

1 Like