Is it possible to get "context aware" embeddings?

Yes and no:

No because OpenAI won’t allow you to extract embeddings from GPT-3 or GPT-4

Instead they have specialized “embedding” models available on their embeddings endpoint that are probably some smaller LLMs.

Yes, because Ada for example was a ~350M model (according to wikipedia) and was openai’s top embedding model for a while, but you have TE-3 now. SOTA embedding models are exactly this at the moment, but based on much larger models.

they likely would be present.

Now there’s indeed a contentious debate on how exactly text should be embedded. you can join one of them here: Using gpt-4 API to Semantically Chunk Documents

but my go to approach is to ensure that whatever chunk you embed should represent some atomic concept. For example: “A propaganda film purporting that a meal composed of emulsified vegetable oil water is a nutiritious meal, and that eating it every day is your patriotic duty” is an atomic idea, and should be closer to “I can’t believe it’s not butter” and “movie” than “Homeowners’ Association”

(I’m not at my work bench atm so I can’t provide you with actual examples, this is just the gist of it)

3 Likes