What is the basis for embeddings calculation?

I have an issue with embeddings calculation I can’t explain on my own:

I use two kinds of website content extraction to calculate embeddings:

  1. I provide the content of document.body.innerText to the OpenAI API to run calculation with https://api.openai.com/v1/embeddings

  2. I clean up HTML and extract the content only from meta title+description, H1-h6, span, p, div, ol, ul, li table, tr, td - and ignore everything from header, nav, aside, footer and ads. After cleanup I format the content with markdown, according its semantic, and provide it to the API to calculate embeddings.

The issue: I get very similar embeddings from both kinds of content: embedding file length is slightly different, an amount of values is the same.

Could somebody explain me, what is here going on? Is my second method correct and good - or not needed for higher quality?

1 Like

It means the semantic meaning of both is roughly the same, so unless you are worried about the expense of the additional tokens used to communicate the tags, there is no point in cleaning up the page!

I assume that might change depending on the amount of content vs. tags?

So for a long blog post, definitely don’t bother cleaning it up?

1 Like

My point is to make the relevance of the content higher through removing of content parts, which aren’t important for the content understanding. Like anchors of social network links, email, messenger, phone, ads. Do you indeed mean, removing them doesn’t boost document relevance?

No, the opposite: removing them doesn’t affect it (as you have demonstrated).

Imagine there is a hidden question being asked: “what’s this about”.

And the AI answers in a million different ways all at once, encoded.

1 Like

Do you know, why is it so?

irrelevant content as i exclude it, is in my understanding like a tier 2 of stop words, which are usually filtered out for embedding calculation. I try to understand, why stop words filtering is important - and my kind of filtering isn’t relevant.

Because they have the same semantic meaning. That’s it.