Extracting text from formatted documents

Our internal knowledge base is formatted, there are images and also content structured as tables - for example troubleshooting etc.

What is the recommended way to read and interpret this content so that it can be understood by GPT?

Currently i’ve had to print the HTML content as a PDF, then extract the text from it manually and then do a considerable amount of cleanup as formatted content like tables don’t translate to text well.

I am wondering if there are any tools that can automatically convert formatted content like this into a format consumable by GPT directly? Also, is it possible to handle image content like troubleshooting screenshots etc. and associate those images with a content block?