Passing token weights to embeddings API?

Is there a way of passing token weightings (e.g. TF-IDF) to the Embeddings API so the tokens are weighted prior to the Embedding being produced?

I presume the embeddings won’t be made publicly available as these are valuable, and I cannot simply use the Embeddings API one token at a time because positional encodings are being applied to the word embeddings behind the scenes?

Hi and welcome to the Developer Forum!

Not something you can do via the API, at least not that I know of, might be some manipulation that is possible after the embedding is created… @curt.kennedy for visibility.

The input to the embeddings model (just one will survive) is text.
The output is a 1536-dimension vector.

That is all.

The AI and hidden state embedding data is tuned to highlight semantic topical similarity to match close embedding spaces. While completely proprietary and undisclosed, it is a large language model at its heart, reading and picking up meanings and themes from 2000-word documents. It doesn’t really do piecemeal word2vec work.

1 Like

I’m not totally sure what you are doing.

But you can break the input string into tokens, embed each token and get a vector for each token. Maybe use a database to prevent looking up the same token multiple times.

Then feed this “tokenized” embedding set of vectors to another model of your choice.

Basically you could leverage this model as an internal embedding engine, and running further inference with your presumably “home-brew” models.

As for TF-IDF, you could run this on your string, and maybe mask off and pull the top tokens (or embedded words/phrases) and send this to the next step in your NLP pipeline.

Again, kinda guessing what your motivation is here …

End-goal is to create a recommender system between two sets of documents.

I’m using chatGPT to do information extraction on both sets of documents which has done a far better and cheaper job of section identification/ classification and entity extraction than my previous custom models. I’m deciding on whether to use the doc embeddings as well, instead of open-source word2vec embeddings (the number of documents I have are not enough to create a good embeddings model) - I appreciate as _j mentioned these embeddings likely aren’t created from a dictionary of word2vec embeddings.

The first baseline model/s are just cosine similarity between the embeddings of two documents. The problem is, words have different importance depending on the category of the document and the embeddings don’t take into account this word-level importance (from the set of documents within a category). I’ve tried the simple steps of cleaning, removing stopwords, getting the tokens then calculating cosine distances but the results aren’t great and the next logical step would be applying some form of token weighting but this cannot be done with the API.

My knowledge on LLM’s isn’t great, but I think I understand for transformers the attention mask allows the model to weigh the importance of [sub]tokens in a sequence however, the individual tokens don’t have a weighting. I could try applying a weighting to the individual embeddings as suggested, but intuitively this feels wrong as I’d be treating the embeddings as word2vec.

Why not try forming your own keywords database?

Here is the governing equation on a system I developed, basically a log-normalized TF-IDF, from my notes:

The information of a word W in document D is then log(1+r)*log(N/R), where r is the frequency of W in D, and R is the total number of documents word W is in, and N is the total number of documents.

Then I take the input, break it into words, and correlated this with the documents, and get the common information. This is done in memory for speed, and uses set intersections inside the computer (which are also fast). Add up all the information in common, and that is your score. Rank these items from high to low.

So you also run an embedding leg. Do the cosine similarity (dot product) search, also in memory for speed. Rank these correlations from high to low.

Finally, combine the two with RRF (reciprocal rank fusion). And now you have your overall ranking.

So you are combining semantic (embeddings) with keywords. With RRF you can even bias one correlation leg over the other.

I would avoid the word2vec route, because the keyword algorithm above does the information weighting for you. With vectors, you are just identifying a string with a vector, but you’d have to form your own information content of each word. You could do this with the information above. But I guess my worry there is latency.

But you could use vectors as more of a “fuzzy information correlation”. Fuzzy because the vectors would give you a proximity to other similar phrases.

So the vector system has advantages, but could add more latency. So feel free to try it, but it might be more work.

The cool thing about RRF is you could run embeddings, keywords (like I have above) and the keyword (fuzzy vector based version). And fuse all three with RRF.