How do I scrape and pass the content from a HTML page to an LLM without crossing the token limit? How to structure the payload
Simple solutions can include:
Parsing the HTML
Stripping out specific unneeded properties or tags if unnecessary
Stripping out just text and passing that
Chunking pages
You might investigate the python package newspaper3k
. It is for stripping HTML tags down to the contents.
https://www.reddit.com/r/Python/comments/1bmtdy0/i_forked_newspaper3k_fixed_bugs_and_improved_its/
It is how OpenAI processed a lot of their web scrapings that went into training AI models.
1 Like