I’ve created a function that allows web browsing for the model to use to perform research however I frequently max out the context window, whats the best way to preprocess HTML to remove all the noise?

Instruct the model to only process information in the html tags? You might be able to have it pull information out of the DOM to identify what is worth looking at.

@TheWarden , a couple possible paths if you’re just inteerested in the text in the pages.

  1. Strip ALL tags, leave only p, h’s, table, lists… What we usually do is transform HTML pages to either Markdown or Yaml, removing all style, comments, tags etc.

  2. Use a different model to select what to show. As @thinktank says this could take the form of showing a model the DOM and have it select the elements you want.

