Can Assistant recognise strikethrough text in a PDF?

ek3 · July 19, 2024, 6:53am

Hi,
I’m currently creating an Assistant which will receive upon starting a new thread a document (as PDF) and answer questions on it.
This does currently work but I’ve run into the problem of it not being able to recognise strikethrough text and therefore answer incorrectly or with invalid data.
My best guess is that the assistant uses OCR to read the PDF and it therefore cannot “see” strikethrough text.
Any experiences with this? Or any idea how I can fix this?
Is it maybe possible to get it to accurately return the correct page and paragraph for the information (I always get 1st page and 1st paragraph)?

Diet · July 19, 2024, 7:13am

Welcome to the community!

We’ve had a similar discussion in the google AI forums (link).

The findings were thus:

strikethrough in OCR still seems to be an unsolved issue
one (low engineering cost) idea is to use a vision model to convert it into structured text, and then vectorize that. There are some professional/enterprise OCR tools that may be able to do this too.

One of the issues with strikethrough is that it’s not always obvious what it means, and I wouldn’t trust TE-3L to understand it. So you need to decide whether you want to elide it, or do something more complicated.

Topic		Replies	Views
Retriever Assistant can't read scanned pdfs? API gpt-4 , api	8	3134	January 10, 2026
Best practice scanned PDF / What model to use? API chatgpt , plugin-development , api , gpt-4-vision	4	2702	January 10, 2026
How can I retrieve data from a PDF that was created from an image captured by a camera? API assistants-api , assistants-files	3	1125	May 4, 2024
OCR of PDF and JPG documents Community api	3	5488	January 3, 2025
Best practices for PDF parsing with Assistants API and file_search tool API assistants-api	6	3652	March 4, 2025

Can Assistant recognise strikethrough text in a PDF?

Related topics