How to extract technical expressions from PDFs so that they can be understood by AI?

There are many solutions to extracting text from PDFs but to extract technical expressions such as math, proofs, type rules, etc, from a PDF and in a way that one could say chat with the PDF and the AI would also understand the technical expressions is not AFAIK available.

This question is an offshoot of another question that only accepts specific working answers, this variation of the question allows for a free form discussion but the replies must be on-topic.


Personal notes (Click triangle to expand)

These notes are kept here so that others can find them and so that I can also find them; no I don’t lose notes I just have thousands of them over a few decades and searching them is not always easy.


Terence Tao just posted a blog entry about using GPT-4.

Embracing change and resetting expectations – Microsoft Unlocked


Lean forum topic about Semantic search for Mathematics

https://leanprover.zulipchat.com/#narrow/stream/219941-Machine-Learning-for-Theorem-Proving/topic/Semantic.20Search.20for.20Mathematics


“Evaluating Token-Level and Passage-Level Dense Retrieval Models for
Math Information Retrieval” (pdf) by Wei Zhong, Jheng-Hong Yang, Yuqing Xie, and Jimmy Lin


Approach Zero
A math-aware search engine

Approach Zero


A much lighter-weight arxiv-sanity from-scratch re-write.

arxiv-sanity

GitHub - karpathy/arxiv-sanity-lite: arxiv-sanity lite: tag arxiv papers of interest get recommendations of similar papers in a nice UI using SVMs over tfidf feature vectors based on paper abstracts.


“Semantic Search in Millions of Equations” (pdf) by Lukas Pfahler and Katharina Morik

Semantic Search in Millions of Equations | Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining


GitHub - stacks/stacks-project: Repository for the Stacks Project


To extract part of a LaTeX document

extract – Extract parts of a document and write to another document

CTAN: Package extract

This worked when I did a proof of concept on a small document.


https://www.mathematik.uni-marburg.de/~seba/publications/sle10.pdf

4 Likes

Great question,
I’ve puzzled with this for a long time myself.

I would always advice you to try and obtain the source files for the document (usually a .tex file)

If that isn’t possible you can use an image2text AI to convert it to latex, but this isn’t always reliable, the only service I’ve tried that could do this reliably is mathpix:

Mathpix is a paid service, but it has an API that accepts PDFs and convert them to latex, math included.

I would be very interested if anyone knows of an open source equivalent that does the same.

4 Likes

I did a bit of searching, and found this project:

https://lukas-blecher.github.io/LaTeX-OCR/

I haven’t tested it myself, but it looks promising

3 Likes

Microsoft’s Math Solver app can do formulaic OCR, on android and iOS but I don’t know if it can be used as part of a toolchain.

1 Like

In searching Microsoft’s Math Solver found this

but it does not expose the converted image for use with a API call, can you connect the dots if I missed something.


Side note: You private message response is still the closest I have seen to an answer for my other question. Since you noted it you should get the credit in this post by adding it as a reply.

1 Like

Took me a bit to find the code. For others seeking the code for LaTeX-OCR

1 Like

Sure, the link is

Which is an OCR to Latex conversion system with a personal use option.

On the Microsoft one, I wondered if perhaps the AI/OCR system used in that app was one of the sub components offered by the Azure ML suite.

2 Likes

I know you’ll be after open source, but Abby fine reader is great in our experience for a variety of tough OCR extraction problems and easy to integrate into your workflows

https://pdf.abbyy.com/

Took a look at the Abby fine reader site. While it can convert a PDF to text with the reader, I did not see that it could convert images with technical expression(s) into a format that an AI can understand.

Example of a technical expression, in this specific case typing rules

Did I miss something on the site?

1 Like

I’m sorry but we have no experience extracting typing rules with Abby, but I would suggest to contact them to confirm. Very good at extracting from image pdfs

Here’s another link for the inf-project, seems like it’s also paid tho.

Yes typing rules are used by an exclusive group.

Here is another example of a technical expression more familiar to many.

image

(ref)

While that is an image of a fraction, the representation is in MathML

   <mfrac>
	<mrow>
	  <mn>3</mn>
	  <mi>x</mi>
	  <mo>−<!-- − --></mo>
	  <mn>2</mn>
	  <mi>x</mi>
	  <mi>y</mi>
	  <mo>+</mo>
	  <mi>c</mi>
	</mrow>
	<mrow>
	  <mi>y</mi>
	  <mo>−<!-- − --></mo>
	  <mn>1</mn>
	</mrow>
  </mfrac>

which has enough metadata such that an AI should be able to understand that it is a fraction.


Passing the MathML fraction above to ChatGPT alone shows that ChatGPT May 24 Version understands it

Prompt


   <mfrac>
	<mrow>
	  <mn>3</mn>
	  <mi>x</mi>
	  <mo>−<!-- − --></mo>
	  <mn>2</mn>
	  <mi>x</mi>
	  <mi>y</mi>
	  <mo>+</mo>
	  <mi>c</mi>
	</mrow>
	<mrow>
	  <mi>y</mi>
	  <mo>−<!-- − --></mo>
	  <mn>1</mn>
	</mrow>
  </mfrac>

Completion

The expression you provided is a fraction. Let’s break it down and simplify it step by step.

The numerator of the fraction is: 3x - 2xy + c
The denominator of the fraction is: y - 1

So the fraction can be written as:

(3x - 2xy + c) / (y - 1)

Since there is no specific instruction regarding what you would like to do with this expression, I assume you want to simplify it or perform some operations. Please let me know what specific operation or simplification you would like me to assist you with.

2 Likes

Eric’s example in latex:

\frac{3x-2xy+c}{y-1}

1 Like

Found this fairly interesting bit of OpenCV code

I’m not convinced it will be able to cope with complex notation, but throwing it in here as a starting point for someone else

It was interesting waking up to this.

I’ve been looking at this very issue over the past few months, and I’ve tried nearly all of the solutions listed here.

My conclusion is that we’re just not there yet. It’s pretty close in many ways, but there is no single solution that is a one-stop-shop for reading technical PDFs.

Mathpix is the best for extracting information from equation and inline environments, but it can be extremely costly to process a large number of pages and it doesn’t handle other environments (algorithm) particularly well.

Even then, you still need to review the work that comes out of Mathpix because it’s only about 90% correct (depending on complexity of course).

I suspect we will see a big push in the open-source space over the next year or two as more people want to feed technical papers into LLMs.

I emailed sciaccess a few months ago to see if they offered an academic discount, and never got a reply.

Didn’t inspire confidence.

Good point and one should seriously consider when understanding technical papers.


Also for others that did not notice the jobs posting on Arxiv, seems they are expanding quite a bit.

Here is an example of a PDF with typing rules available on Arxiv.

“GPC: A Pattern Calculus for Property Graphs” by Nadime Francis, Amélie Gheerbrant, Paolo Guagliardo, Leonid Libkin, Victor Marsault, Wim Martens, Filip Murlak, Liat Peterfreund, Alexandra Rogova, Domagoj Vrgoč

It is available in a few different formats.

The typing rules being demonstrated are in Figure 2.

image

Text source of typing rules (Click triangle to expand)

This is from file 2210.16580 which is downloaded by clicking Download source on the Format selector page

\begin{figure*}\centering

\newcommand{\custominference}[2]{$\displaystyle\inference{#1}{#2}$}

%\custominference{}{\nodelit{x\,[:\ell]} \vdash x:\tnode} 
\custominference{}{\nodelit{x} \vdash x:\tnode}
\hfil
\custominference{}{\nodelit{x:\ell} \vdash x:\tnode}
\hfil
%\custominference {}{\arclit{x\, [:\ell]}\  \vdash x:\tedge} 
\custominference {}{\arclit{x}\  \vdash x:\tedge} 
\hfil
\custominference {}{\arclit{x :\ell}\  \vdash x:\tedge} 
\hfil
\custominference{ x \notin \var(\pat)}{x=\restrictor\ \pat \vdash x:\tpath}
%\custominference{\pat \vdash x:\tau \quad p \notin \var(\pat)}{p=\pat \vdash p:\tpath,\ x:\tau}


\bigskip

\custominference {\pat\vdash z:\tau}{\pat\quantlit{n..m} \vdash z:\tlist(\tau)} 
\hfil
\custominference {\pat\vdash z:\tau}{\restrictor \pat \vdash z:\tau}
%\hfil
%\custominference{\pat \vdash z:\tau \quad z\neq x}{x=\pat \vdash  z:\tau}
\hfil
\custominference{\pat \vdash z:\tau \quad z\neq x }{ x= \restrictor \pat \vdash  z:\tau}

\bigskip

\custominference {\pat\vdash x:\tau \quad \tau\in\set{\tnode,\tedge}}{\pat \vdash x.a=c:\tbool} \hfil
\custominference {\pat\vdash x:\tau\quad \pat\vdash y:\tau'\quad\tau,\tau'\in\set{\tnode,\tedge}}{\pat \vdash x.a=y.b:\tbool} 

\bigskip

\custominference {\pat\vdash \theta:\tbool \quad \pat\vdash\theta':\tbool}{\pat \vdash \theta\wedge\theta':\tbool} 
\hfil
\custominference {\pat\vdash \theta:\tbool \quad \pat\vdash\theta':\tbool}{\pat \vdash \theta\vee\theta':\tbool} 
\hfil
\custominference {\pat\vdash \theta:\tbool}{\pat \vdash \neg\theta:\tbool} 
\hfil
\custominference {\pat\vdash \theta:\tbool \quad \pat\vdash z:\tau}{\pat\condlit{\theta} \vdash z:\tau} 

\bigskip

\custominference {\pat_1\vdash z:\tau \quad \pat_2\vdash z:\tau}{\pat_1 + \pat_2 \vdash z:\tau}
\hfil
\custominference {\pat_1\vdash z:\tau \quad \pat_2\vdash z:\tmaybe(\tau)}{\pat_1 + \pat_2 \vdash z:\tmaybe(\tau)}
\hfil
\custominference {\pat_1\vdash z:\tmaybe(\tau) \quad \pat_2\vdash z:\tau}{\pat_1 + \pat_2 \vdash z:\tmaybe(\tau)}

\bigskip

\custominference {\pat_1\vdash z:\tau \quad  z \notin \var(\pat_2) }{\pat_1 + \pat_2 \vdash 
z: \tau?} 
\hfil
\custominference {\pat_2\vdash z:\tau \quad z \notin \var(\pat_1) }{\pat_1 + \pat_2 \vdash z: \tau?}

\bigskip

\custominference {\pat_1\vdash z:\tau \quad \pat_2\vdash z:\tau \quad \tau\in\set{\tnode,\tedge}}{\pat_1 \, \pat_2 \vdash z:\tau} 
\hfil
\custominference {\pat_1\vdash z:\tau \quad z \not\in\var(\pat_2)}{\pat_1 \, \pat_2 \vdash z:\tau}
\hfil
\custominference {\pat_2\vdash z:\tau \quad z \not\in\var(\pat_1)}{\pat_1 \, \pat_2 \vdash z:\tau} 

\bigskip

\custominference {Q_1\vdash z:\tau \quad Q_2\vdash z:\tau \quad \tau\in\set{\tnode,\tedge}}{Q_1 , Q_2 \vdash z:\tau}
\hfil
\custominference{Q_1\vdash z:\tau \quad z\not\in\var(Q_2) }{Q_1, Q_2 \vdash z:\tau}
\hfil
\custominference{Q_2\vdash z:\tau \quad z\not\in \var(Q_1) }{Q_1, Q_2 \vdash z:\tau}

\caption{Typing rules for the \gpml type system.}
\label{fig:type-system}
\end{figure*}
Does ChatGPT understand the TeX? (Click triangle to expand)

ChatGPT May 24 Version

Prompt

\begin{figure*}\centering

\newcommand{\custominference}[2]{$\displaystyle\inference{#1}{#2}$}

%\custominference{}{\nodelit{x,[:\ell]} \vdash x:\tnode}
\custominference{}{\nodelit{x} \vdash x:\tnode}
\hfil
\custominference{}{\nodelit{x:\ell} \vdash x:\tnode}
\hfil
%\custominference {}{\arclit{x, [:\ell]}\ \vdash x:\tedge}
\custominference {}{\arclit{x}\ \vdash x:\tedge}
\hfil
\custominference {}{\arclit{x :\ell}\ \vdash x:\tedge}
\hfil
\custominference{ x \notin \var(\pat)}{x=\restrictor\ \pat \vdash x:\tpath}
%\custominference{\pat \vdash x:\tau \quad p \notin \var(\pat)}{p=\pat \vdash p:\tpath,\ x:\tau}

\bigskip

\custominference {\pat\vdash z:\tau}{\pat\quantlit{n…m} \vdash z:\tlist(\tau)}
\hfil
\custominference {\pat\vdash z:\tau}{\restrictor \pat \vdash z:\tau}
%\hfil
%\custominference{\pat \vdash z:\tau \quad z\neq x}{x=\pat \vdash z:\tau}
\hfil
\custominference{\pat \vdash z:\tau \quad z\neq x }{ x= \restrictor \pat \vdash z:\tau}

\bigskip

\custominference {\pat\vdash x:\tau \quad \tau\in\set{\tnode,\tedge}}{\pat \vdash x.a=c:\tbool} \hfil
\custominference {\pat\vdash x:\tau\quad \pat\vdash y:\tau’\quad\tau,\tau’\in\set{\tnode,\tedge}}{\pat \vdash x.a=y.b:\tbool}

\bigskip

\custominference {\pat\vdash \theta:\tbool \quad \pat\vdash\theta’:\tbool}{\pat \vdash \theta\wedge\theta’:\tbool}
\hfil
\custominference {\pat\vdash \theta:\tbool \quad \pat\vdash\theta’:\tbool}{\pat \vdash \theta\vee\theta’:\tbool}
\hfil
\custominference {\pat\vdash \theta:\tbool}{\pat \vdash \neg\theta:\tbool}
\hfil
\custominference {\pat\vdash \theta:\tbool \quad \pat\vdash z:\tau}{\pat\condlit{\theta} \vdash z:\tau}

\bigskip

\custominference {\pat_1\vdash z:\tau \quad \pat_2\vdash z:\tau}{\pat_1 + \pat_2 \vdash z:\tau}
\hfil
\custominference {\pat_1\vdash z:\tau \quad \pat_2\vdash z:\tmaybe(\tau)}{\pat_1 + \pat_2 \vdash z:\tmaybe(\tau)}
\hfil
\custominference {\pat_1\vdash z:\tmaybe(\tau) \quad \pat_2\vdash z:\tau}{\pat_1 + \pat_2 \vdash z:\tmaybe(\tau)}

\bigskip

\custominference {\pat_1\vdash z:\tau \quad z \notin \var(\pat_2) }{\pat_1 + \pat_2 \vdash
z: \tau?}
\hfil
\custominference {\pat_2\vdash z:\tau \quad z \notin \var(\pat_1) }{\pat_1 + \pat_2 \vdash z: \tau?}

\bigskip

\custominference {\pat_1\vdash z:\tau \quad \pat_2\vdash z:\tau \quad \tau\in\set{\tnode,\tedge}}{\pat_1 , \pat_2 \vdash z:\tau}
\hfil
\custominference {\pat_1\vdash z:\tau \quad z \not\in\var(\pat_2)}{\pat_1 , \pat_2 \vdash z:\tau}
\hfil
\custominference {\pat_2\vdash z:\tau \quad z \not\in\var(\pat_1)}{\pat_1 , \pat_2 \vdash z:\tau}

\bigskip

\custominference {Q_1\vdash z:\tau \quad Q_2\vdash z:\tau \quad \tau\in\set{\tnode,\tedge}}{Q_1 , Q_2 \vdash z:\tau}
\hfil
\custominference{Q_1\vdash z:\tau \quad z\not\in\var(Q_2) }{Q_1, Q_2 \vdash z:\tau}
\hfil
\custominference{Q_2\vdash z:\tau \quad z\not\in \var(Q_1) }{Q_1, Q_2 \vdash z:\tau}

\caption{Typing rules for the \gpml type system.}
\label{fig:type-system}
\end{figure*}

Completion

The figure you provided presents a set of typing rules for the \gpml type system. These rules define how different patterns and expressions are typed in the system. Let’s go through each rule:

  1. The first two rules state that a node or an arc literal with or without a label is typed as a node or an edge, respectively.

  2. The third rule states that if a variable x is not bound in the pattern (\pat), then x can be restricted to a path and typed as a path (\tpath).

  3. The fourth rule states that if a pattern \pat is typed as z of type \tau, then the pattern \pat\quantlit{n..m} is typed as z of type List(\tau).

  4. The fifth and sixth rules deal with restricting patterns. If a pattern \pat is typed as z of type \tau, then the pattern \restrictor\ \pat is also typed as z of type \tau.

  5. The seventh rule states that if a pattern \pat is typed as x of type \tau, where \tau is either a node or an edge, then the expression x.a=c is typed as a Boolean (\tbool).

  6. The eighth rule states that if a pattern \pat is typed as x of type \tau and another pattern y is typed as y of type \tau', where both \tau and \tau' are either nodes or edges, then the expression x.a=y.b is typed as a Boolean (\tbool).

  7. The ninth, tenth, eleventh, and twelfth rules deal with Boolean operations. They state that if patterns or expressions \pat and \theta are typed as Boolean values, then the conjunction (\wedge), disjunction (\vee), negation (\neg), and conditional (\condlit{\theta}) expressions are also typed as Booleans.

  8. The thirteenth rule states that if patterns \pat_1 and \pat_2 are both typed as z of type \tau, then the pattern \pat_1 + \pat_2 is also typed as z of type \tau.

  9. The fourteenth, fifteenth, and sixteenth rules deal with patterns that include optional values. They specify how patterns with optional values are typed based on the presence or absence of the value.

  10. The seventeenth, eighteenth, and nineteenth rules deal with concatenating patterns and expressions. They specify the typing for the concatenation operator (,). If patterns or expressions \pat_1 and \pat_2 are typed as z of type \tau, then \pat_1 \, \pat_2 is also typed as z of type \tau. Additionally, if the variable z is not bound in either \pat_1 or \pat_2, then the typing remains the same.

  11. The twentieth, twenty-first, and twenty-second rules deal with concatenating queries (Q_1 and Q_2). They specify the typing for the concatenation operator (,). If queries Q_1 and Q_2 are both typed as z of type \tau, then Q_1 , Q_2 is also typed as z of type \tau. Additionally, if the variable z is not bound in either Q_1 or Q_2, then the typing remains the same.

These rules define the typing behavior of various patterns and expressions in the \gpml type system.


Personal note:

In the completion it notes The first two rules state but it really should be The first four rules state. The first rule is for a Node without a type, the second is for a Node with a type, the third is for an edge without a type and the fourth is for an edge with a type. Thus the number for the remainder of the rules is off but if you have a bit of understanding of how to read typing rules it is easy to match up the description with the typing rule.

PostScript version of second typing rule (Click triangle to expand)

This is from file 2210.16580v1.ps.gz which is downloaded by clicking Download PostScript on the Format selector page

727 190 654 5 v 290 w FN(\271)q FH(G)27 b FF(:)21 b FH(\022)6
b FN(\272)25 b(`)18 b FH(G)28 b FF(:)21 b Fw(No)r(de)p
Does ChatGPT understand PostScript (Click triangle to expand)

Note: Only the PostScript for the second type rule is used as giving ChatGPT all of the PostScript for the typing rules would exceed the token limit.

Prompt

The following is a portion of PostScript, explain what you can.

 727 190 654 5 v 290 w FN(\271)q FH(G)27 b FF(:)21 b FH(\022)6
b FN(\272)25 b(`)18 b FH(G)28 b FF(:)21 b Fw(No)r(de)p

Completion

The portion of code you provided appears to be written in the PostScript programming language. PostScript is a page description language commonly used for printing and graphics.

Let’s analyze the code line by line:

727 190 654 5 v 290 w FN(\271)q FH(G)27 b FF(:)21 b FH(\022)6 b FN(\272)25 b(`)18 b FH(G)28 b FF(:)21 b Fw(No)r(de)p
  1. 727 190 654 5 v: This line consists of four numeric values (727, 190, 654, and 5) followed by the operator v. These values are likely coordinates or parameters related to positioning or transformations. Without more context, it’s difficult to determine their exact purpose.

  2. 290 w: This line contains a numeric value (290) followed by the operator w. Again, without more context, it’s unclear what this line represents.

  3. FN(\271)q FH(G)27 b FF(:)21 b FH(\022)6 b FN(\272)25 b()18 b FH(G)28 b FF(:)21 b Fw(No)r(de)p`: This is a series of instructions involving various operators and string literals. Here’s a breakdown of the individual components:

    • FN(\271): This is likely a font-related instruction, specifying a particular font with a parameter \271.
    • q: This operator saves the current graphics state.
    • FH(G)27 b: This could be another font-related instruction specifying a different font (G) with a parameter 27. The b operator might be related to setting the font weight or style.
    • FF(:)21 b: This is likely a command to print or display a colon (:) using the current font. The 21 b might be related to font size or position.
    • FH(\022)6 b: This could be another font-related instruction, specifying a different font with a non-printable character \022. The 6 b might be related to font size or position.
    • FN(\272)25 b: This could be another font-related instruction specifying a different font with a parameter \272. The 25 b might be related to font size or position.
    • ()18 b: This is likely a command to print or display an opening parenthesis (() using the current font. The 18 b` might be related to font size or position.
    • FH(G)28 b: This could be another font-related instruction specifying a different font (G) with a parameter 28. The b operator might be related to setting the font weight or style.
    • FF(:)21 b: Similar to the previous occurrence, this is likely a command to print or display a colon (:) using the current font. The 21 b might be related to font size or position.
    • Fw(No)r(de)p: This appears to be a command to print or display the string “No” using a specific font (w). The subsequent characters r(de)p are not immediately clear without additional context.

Overall, without further context or knowledge of the surrounding PostScript code, it’s challenging to provide a definitive interpretation or explanation of the code.


Personal note.

Having parsed PostScript by hand know that Fw(No)r(de) is actually the word Node in the second typing rule. While I did not expect ChatGPT to understand everything about the PostScript snippet, was surprised that it did not recognize that as the word Node.

2 Likes

For those of us without experience with typing systems, can you tell us how well it understood the TeX code?

As with most responses by ChatGPT it depends upon your level of knowledge. For a beginner they do very good but for a expert they would probably find more wrong than correct. But as we know sometimes just getting over the initial hump for learning when there is nothing and having some info as demonstrated in the example will get someone interested enough to keep going.

A much better example to answer your question would be to use the typing rules for a functional programming language like F# or Haskell where the typing rules can be demonstrated also with code and the types included in the function signature.

If I find such I will give it a try and if it works will post as a reply. :slightly_smiling_face:


Update on the search.

I spent about an hour after posting that looking and the language with the best typing rules published was for ML but they were to complex for the beginner.

However typing rules for Lambda Calculus are very easy to understand (ref) but the problem is that many do not understand Lambda Calculus. Typed Lambda Calculus might be better but then the typing rules are a bit more complex.

2 Likes