Is it possible to get "context aware" embeddings?

There are many questions on this but I can’t find any answer from the OpenAI team:

community .openai.com/t/contextualized-embedding-get-gpt3-embeddings-of-each-token-in-a-sentence/119715
community .openai.com/t/is-it-possible-to-build-contextualized-embedding-with-gpt/103152

(not allowed to include links in my posts so had to break them)

my goal is to embed pieces of a document (or words in a sentence) but I want it the individual embeddings to be context-aware.

For example, if I embed “bank” in “river bank”, it should be a different embedding than “bank” in “financial bank”.

you COULD just embed the full sentence but I do want control over the granularity of my search. This is possible, just not currently exposed in the API, is that correct?

1 Like

I’m not sure what you mean. Can you explain what you mean with contextualization and granularity?

If you have “bank” and “river bank” and “money” in your DB, the query “I went to the bank to watch the fishes” is probably gonna be closest to “river bank”, but “they have a fish tank at the bank” would probably be closer to “bank”

2 Likes

yeah sorry I’m very ignorant on the fundamentals here. let me describe my understanding & please correct me:

  • traditional word vectors are not “context aware”, the vector embedding for “bank” is fixed
  • LLM’s tokenize their input, then embed it, and those embeddings are “context aware” (in the sense that, the same tokens might have different embeddings depending on what document/sentence they are in)

if this is correct, I’m wondering if it’s possible to extract these embeddings.

Ok, as for what I am trying to do, what I mean by “granularity”:

say I have a book. I want to basically do a “CTRL+F” but for semantic concepts. Like say I want to find all the places in a history book where they discuss “film”. If I had an embedding of every single word individually, I could do a vector search (with some configurable distance threshold) to find closely related concepts like “movie”, “hollywood”, maybe even “naval-gazing” if I expand the distance threshold enough.

Ok, now I want to expand it to paragraphs that talk about this, even if they don’t mention any of those individual words. For that I could, instead of embedding every single word in the book, embed each sentence or each paragraph. Then that would work. I could basically maintain a hierarchy of embeddings like this (I think this is what OpenAI recommends, from my understanding of https://openai.com/index/summarizing-books/)

I’m just wondering if it would be easier/more efficient to get at these “context aware embeddings” (assuming such a thing exists). What I think that enables is that, if you have the word “film” in a paragraph where they are talking about it in a negative way, or talking about films being used in propaganda or whatever, that those semantics would be present in the embedding.

2 Likes

sorry for the wall of text. I think TL;DR this is my question ^

basically, is it true that LLMs work this way, and if so, is it meaningful to try and work with these embeddings.

1 Like

Yes and no:

No because OpenAI won’t allow you to extract embeddings from GPT-3 or GPT-4

Instead they have specialized “embedding” models available on their embeddings endpoint that are probably some smaller LLMs.

Yes, because Ada for example was a ~350M model (according to wikipedia) and was openai’s top embedding model for a while, but you have TE-3 now. SOTA embedding models are exactly this at the moment, but based on much larger models.

they likely would be present.

Now there’s indeed a contentious debate on how exactly text should be embedded. you can join one of them here: Using gpt-4 API to Semantically Chunk Documents

but my go to approach is to ensure that whatever chunk you embed should represent some atomic concept. For example: “A propaganda film purporting that a meal composed of emulsified vegetable oil water is a nutiritious meal, and that eating it every day is your patriotic duty” is an atomic idea, and should be closer to “I can’t believe it’s not butter” and “movie” than “Homeowners’ Association”

(I’m not at my work bench atm so I can’t provide you with actual examples, this is just the gist of it)

3 Likes
import numpy as np
from openai import Client as o; cl = o()

text = ["A propaganda film purporting that a meal composed of emulsified vegetable oil water is a nutiritious meal, and that eating it every day is your patriotic duty",
        "I can’t believe it’s not butter", 
        "movie",
        "Homeowners’ Association"]
for model in ["text-embedding-3-large"]:
    try:
        out = cl.embeddings.create(input=text, model=model)
        print("\n---", model)
    except Exception as e:
        print(f"ERROR {e}")
    array = np.array([data.embedding for data in out.data])
    for compi, comp in enumerate(text[:1]):
        for i, j in zip(text, array):
            print(f"{np.dot(array[compi], j):.5f} - {i[:30]}")

— text-embedding-3-large
1.00000 - A propaganda film purporting t
0.30461 - I can’t believe it’s not butte
0.22100 - movie
0.09620 - Homeowners’ Association

4 Likes

Hi Omar,

Not exactly sure what you mean, as embeddings already encode semantic values, not tokens or words. When you use an embedding model, the text string is converted to tokens, which are then analyzed and embedded into thousands of semantic “features” corresponding with the dimensionality of the embedding model. While these dimensions lack direct interpretability.

But the aggregate effect is that if you embed “river bank” it will be close in value to “water’s edge” and not near “financial bank” or “blood bank”.

The purpose of embeddings is to compare the semantic features of two different inputs, whether it is sound, or images, or text. If you embed very small passages of text, such as individual words, you lose a lot of the value that transformer-based embeddings provide because a lot of the semantic features from a passage of text is derived from the relationships between words that are further apart.

If you are looking to build a system that can exploit the properties of different sized embeddings, you might look at small-to-big retrieval, or node retrieval, which are techniques that use positional metadata to relate small vectors in order to also return the information around them, not sure if that addresses your query.

1 Like

I recently got interested in this concept as a result of this post: Retrieving “Adjacent” Chunks for Better Context - Support - Weaviate Community Forum