How to prepare the content of HTML page for embeddings calculation

orlov · June 6, 2024, 10:41am

I would like to know, what is the most efficient approach to prepare content for high-quality embeddings calculation.

We have a HTML page, which content we would you for embedding calculation.

How is it recommended to prepare content for calculating embeddings:

The whole HTML code remains and embeddings are calculated with it?
Non-semantic HTML-code is filtered out, and only semantical remains (like headings, HTML5 tags or structured data).
The whole HTML code is filtered out and it remains only text?

If case 3 is recommended, all tables and lists are filtered out - does this not disturb the calculation?

Or are embeddings calculated from outta a container like a bag-of-words, where markup semantic doesn’ play any role?

egils · June 6, 2024, 11:22am

If your HTML is clean and well structured you can convert it into Markdown and then split into chunks if necessary.

My strategy is to retain in embeddings two versions of the content:

plain text - with removed non-essential characters (e.g. hashes, multiple newlines, etc) for calculating embeddings and determining distances for retrieval;
markdown - for inclusion into context to provide well organised content.

_j · June 6, 2024, 1:02pm

There are aspects that you likely do NOT want sent to the embeddings model, like the quality “looks like a web page”. That could degrade the similarity results when your query text is not also HTML.

I would strip it bare, leaving only linefeeds for paragraphs or BR.

What you send to API to receive an embeddings vector can be different than the source data you store.

Topic		Replies	Views
What is the basis for embeddings calculation? API	6	340	June 10, 2024
How to prepare data for AI Assistant? GPT builders	3	645	July 5, 2024
Is it good practice to send html tags with context API chatgpt	1	911	January 30, 2024
Html in text uploaded via files api API	2	1600	May 4, 2022
Right way to calculate embeddings for a thema API	4	289	June 21, 2025

How to prepare the content of HTML page for embeddings calculation

Related topics