How to prepare the content of HTML page for embeddings calculation

I would like to know, what is the most efficient approach to prepare content for high-quality embeddings calculation.

We have a HTML page, which content we would you for embedding calculation.

How is it recommended to prepare content for calculating embeddings:

  1. The whole HTML code remains and embeddings are calculated with it?
  2. Non-semantic HTML-code is filtered out, and only semantical remains (like headings, HTML5 tags or structured data).
  3. The whole HTML code is filtered out and it remains only text?

If case 3 is recommended, all tables and lists are filtered out - does this not disturb the calculation?

Or are embeddings calculated from outta a container like a bag-of-words, where markup semantic doesn’ play any role?

If your HTML is clean and well structured you can convert it into Markdown and then split into chunks if necessary.

My strategy is to retain in embeddings two versions of the content:

  1. plain text - with removed non-essential characters (e.g. hashes, multiple newlines, etc) for calculating embeddings and determining distances for retrieval;
  2. markdown - for inclusion into context to provide well organised content.
1 Like

There are aspects that you likely do NOT want sent to the embeddings model, like the quality “looks like a web page”. That could degrade the similarity results when your query text is not also HTML.

I would strip it bare, leaving only linefeeds for paragraphs or BR.

What you send to API to receive an embeddings vector can be different than the source data you store.