Hi @helpdeskBMTC and welcome to the forums!
Regarding extracting/referencing page numbers: I actually created this thread a while ago. After performing numerous experiments, the closest I’ve gotten is by giving the instruction to treat each page as an image and use OCR
. If the page number is visible, it will extract it, but it may not align with your actual document - for example, your document may have table of contents, a cover page, or other pages that are not numbered, so there may be an offset that is difficult to control.
Regarding extracting article references, footnotes, etc - this should in principle be ok if you provide it some examples of how they may look like.
But as @PaulBellow stated, it is difficult to guarantee high accuracy due to how text is parsed and chunked “under the hood”.