Passing html content to gpt4o

How do I scrape and pass the content from a HTML page to an LLM without crossing the token limit? How to structure the payload

Simple solutions can include:

Parsing the HTML

Stripping out specific unneeded properties or tags if unnecessary
Stripping out just text and passing that

Chunking pages

You might investigate the python package newspaper3k. It is for stripping HTML tags down to the contents.

https://www.reddit.com/r/Python/comments/1bmtdy0/i_forked_newspaper3k_fixed_bugs_and_improved_its/

It is how OpenAI processed a lot of their web scrapings that went into training AI models.

1 Like