There are many questions on this but I can’t find any answer from the OpenAI team:
(not allowed to include links in my posts so had to break them)
my goal is to embed pieces of a document (or words in a sentence) but I want it the individual embeddings to be context-aware.
For example, if I embed “bank” in “river bank”, it should be a different embedding than “bank” in “financial bank”.
you COULD just embed the full sentence but I do want control over the granularity of my search. This is possible, just not currently exposed in the API, is that correct?
I’m not sure what you mean. Can you explain what you mean with contextualization and granularity?
If you have “bank” and “river bank” and “money” in your DB, the query “I went to the bank to watch the fishes” is probably gonna be closest to “river bank”, but “they have a fish tank at the bank” would probably be closer to “bank”
yeah sorry I’m very ignorant on the fundamentals here. let me describe my understanding & please correct me:
traditional word vectors are not “context aware”, the vector embedding for “bank” is fixed
LLM’s tokenize their input, then embed it, and those embeddings are “context aware” (in the sense that, the same tokens might have different embeddings depending on what document/sentence they are in)
if this is correct, I’m wondering if it’s possible to extract these embeddings.
Ok, as for what I am trying to do, what I mean by “granularity”:
say I have a book. I want to basically do a “CTRL+F” but for semantic concepts. Like say I want to find all the places in a history book where they discuss “film”. If I had an embedding of every single word individually, I could do a vector search (with some configurable distance threshold) to find closely related concepts like “movie”, “hollywood”, maybe even “naval-gazing” if I expand the distance threshold enough.
Ok, now I want to expand it to paragraphs that talk about this, even if they don’t mention any of those individual words. For that I could, instead of embedding every single word in the book, embed each sentence or each paragraph. Then that would work. I could basically maintain a hierarchy of embeddings like this (I think this is what OpenAI recommends, from my understanding of https://openai.com/index/summarizing-books/)
I’m just wondering if it would be easier/more efficient to get at these “context aware embeddings” (assuming such a thing exists). What I think that enables is that, if you have the word “film” in a paragraph where they are talking about it in a negative way, or talking about films being used in propaganda or whatever, that those semantics would be present in the embedding.
No because OpenAI won’t allow you to extract embeddings from GPT-3 or GPT-4
Instead they have specialized “embedding” models available on their embeddings endpoint that are probably some smaller LLMs.
Yes, because Ada for example was a ~350M model (according to wikipedia) and was openai’s top embedding model for a while, but you have TE-3 now. SOTA embedding models are exactly this at the moment, but based on much larger models.
but my go to approach is to ensure that whatever chunk you embed should represent some atomic concept. For example: “A propaganda film purporting that a meal composed of emulsified vegetable oil water is a nutiritious meal, and that eating it every day is your patriotic duty” is an atomic idea, and should be closer to “I can’t believe it’s not butter” and “movie” than “Homeowners’ Association”
(I’m not at my work bench atm so I can’t provide you with actual examples, this is just the gist of it)
import numpy as np
from openai import Client as o; cl = o()
text = ["A propaganda film purporting that a meal composed of emulsified vegetable oil water is a nutiritious meal, and that eating it every day is your patriotic duty",
"I can’t believe it’s not butter",
"movie",
"Homeowners’ Association"]
for model in ["text-embedding-3-large"]:
try:
out = cl.embeddings.create(input=text, model=model)
print("\n---", model)
except Exception as e:
print(f"ERROR {e}")
array = np.array([data.embedding for data in out.data])
for compi, comp in enumerate(text[:1]):
for i, j in zip(text, array):
print(f"{np.dot(array[compi], j):.5f} - {i[:30]}")
— text-embedding-3-large
1.00000 - A propaganda film purporting t
0.30461 - I can’t believe it’s not butte
0.22100 - movie
0.09620 - Homeowners’ Association
Not exactly sure what you mean, as embeddings already encode semantic values, not tokens or words. When you use an embedding model, the text string is converted to tokens, which are then analyzed and embedded into thousands of semantic “features” corresponding with the dimensionality of the embedding model. While these dimensions lack direct interpretability.
But the aggregate effect is that if you embed “river bank” it will be close in value to “water’s edge” and not near “financial bank” or “blood bank”.
The purpose of embeddings is to compare the semantic features of two different inputs, whether it is sound, or images, or text. If you embed very small passages of text, such as individual words, you lose a lot of the value that transformer-based embeddings provide because a lot of the semantic features from a passage of text is derived from the relationships between words that are further apart.
If you are looking to build a system that can exploit the properties of different sized embeddings, you might look at small-to-big retrieval, or node retrieval, which are techniques that use positional metadata to relate small vectors in order to also return the information around them, not sure if that addresses your query.
You can embed at different granularities, like word/sentence/paragraph/page, etc. Search across all the granularities, then fuse the results to form an overall result (RRF/RSF). Also, you can weight each granularity differently … so paragraphs more than sentences more than words, or whatever you decide.
If each granular chunk had an index indicating the location within the corpus, you could also grab adjacent chunks to provide more surrounding context.
So search at all your granularities, fuse the result, grab the highest fused results, expand each result by some radius, then these are your chunks for RAG.
Each embedding chunk has a metadata property which uniquely identifies it and it’s position in the chunked document.
When the chunk is retrieved by cosine similarity search, I programmatically use the chunk identifier to locate adjacent chunks.
I send the original key chunk with adjacent chunks to the model along with question and chat history to render a response.
So far, this is working very well. My chunks can be as small as one sentence or as large as multiple paragraphs – I can adjust how many adjacent chunks are returned depending upon the type of documents processed. This would be the adjacent chunk “radius” as defined by @curt.kennedy