Can Assistant recognise strikethrough text in a PDF?

Hi,
I’m currently creating an Assistant which will receive upon starting a new thread a document (as PDF) and answer questions on it.
This does currently work but I’ve run into the problem of it not being able to recognise strikethrough text and therefore answer incorrectly or with invalid data.
My best guess is that the assistant uses OCR to read the PDF and it therefore cannot “see” strikethrough text.
Any experiences with this? Or any idea how I can fix this?
Is it maybe possible to get it to accurately return the correct page and paragraph for the information (I always get 1st page and 1st paragraph)?

Welcome to the community!

We’ve had a similar discussion in the google AI forums (link).

The findings were thus:

  1. strikethrough in OCR still seems to be an unsolved issue
  2. one (low engineering cost) idea is to use a vision model to convert it into structured text, and then vectorize that. There are some professional/enterprise OCR tools that may be able to do this too.

One of the issues with strikethrough is that it’s not always obvious what it means, and I wouldn’t trust TE-3L to understand it. So you need to decide whether you want to elide it, or do something more complicated.