Does it make sense to add context to text before embedding?

I am analysing a large dataset of concepts applied to a specific domain.
These concepts may mean different things in general, and mean something very similar when applied to specific domains.
For example: “Python” and “Go”, as general terms or specifically applied on the computer science domain.

If i wanted to cluster a large dataset of computer science concepts, by retrieving their embeddings and performing kmeans or a similar algorithm. Does it make sense to pre-process the concepts and add a computer science concept before calling the embedding api?
For the example above: instead of retrieving the embeddings of “Python” and “Go”, would it make sense to retrieve the embeddings of “‘Python’ (Computer Science)” and “‘Go’ (Computer Science)”

Assuming the answer is yes, are there any relevant examples or papers on this topic?

Depends,

You’d want to disambiguate your stuff as much as possible, obviously. But you don’t want to go overboard and exhaust ADA’s attention window, for example. The 8191 token input limit is a joke.

If python pertains to the language from the context, then it’s probably not necessary.

However

The embedding models openai currently provides are very sensitive to the shape and format of your text. Including highly artificial fluff in your corpus might help erase this superficial pattern matching on shorter texts, so it may not be a bad idea if you can do it uniformly :thinking:

You likely won’t get around doing experimentation. A heatmap has always been helpful for me in getting a feel for how the model behaves.


The problem is that the field is evolving so rapidly and there’s so much garbage coming out in terms of papers - even evals can’t keep up with the models and capabilities that keep emerging - so it’s incredibly difficult.


In conclusion I think it’s a good idea worth exploring - if I have more time and don’t forget I’ll get back to it. If you have the opportunity to test it and share your findings it would be tremendusously appreciated!

Yes. But I agree with @Diet that you don’t want to spend too much time going overboard with it.

In my mind, I see three concepts:

  1. SNR Signal to Noise Ratio. You are trying to increase the SNR for these fairly common words by giving them more context. Good. But keep in mind that doing this too much simply increases the amplitude of noise.

  2. AlphaHybrid. This is a search that sort of mixes semantic search with keyword search. In your case, you would be looking more at semantic search (meaning) rather they keywords (as Go will be found in a lot of other contexts).

  3. Metadata and Context. Not knowing a thing about your data, when I saw your example of “Go” (Computer Science), the idea of “Categorization” popped into my head. Not sure how or if this even applies, but the concept of using “metadata” in addition to your embeddings to give it more context is also a good approach. I think everyone will agree that gpt-4 and above at least will understand “Python” within the context of the sentences in which it is used in your embeddings. But, adding classifiers to those embeddings even further with titles, descriptions , categories and even summarizations would only increase the AI’s understanding of the contexts where the word(s) are found.

Anyway, bottom line is you’ll need to try out a few methods to see which give you the better responses.

thanks @Diet and @SomebodySysop!

I ran a little toy example:
I generated 49 generates statements that take the shape f"{person} is from {city}" (7 People * 7 Cities)

And retrieved embeddings with additional context, 
I tried many different variations, here is one example:
embedding_contexts = {
    'no_context': df['statement'],
    'location': 'The **CITY** indicated in the statement: """' + df['statement'] + '"""',
    'person':  'The **FIRST NAME** indicated in the statement: """' + df['statement'] + '"""',
}

Similarity heat map is a nice trick (Thanks @Diet !):


Adding context definitely changes the outcome, but it seems I always added more noise than signal, and the outcome changed in unpredictable ways.
For example, at times, adding the “Person” context seemed to ampliphy “City” more than when adding the location context.

2 Likes

Ah, with context you actually meant a sort of promptability/instructability? :grimacing:

we investigated that a little bit in this thread as well: New OpenAI Announcement! Updated API Models and no more lazy outputs - #9 by Diet (in this case the instruction rear-loaded instead of front loaded)

However, if we think it through a little: what does a bob in london have in common with a person called bob that happens to live in reno vs a mary that lives in reno? This could just be the result of unfortunate prompting…

That said, it’s interesting how dramatically the model discriminates by gender, and how it considers DANA and RENO different from the others.

If we wanted to improve the signal as you proposed in your original post, would it make sense to rerun the experiment something like this?:

"Dana (Person) is from Reno (City)"
<=>
"We're focusing on the person's name in the statement: Dana (Person) is from Reno (City)"
<=>
"We're focusing on the city in the statement: Dana (Person) is from Reno (City)"

:thinking:

I’d also like to try this with mistral embeds at some point, but msft is really stingy with their gpu vms atm :frowning:

“Sam is from Los Angles. We’re focusing on the person’s name in the previous statement.” (rear loaded) seems to be the best one yet.

To give this context into a real world use case,
Consider an advertising platform that analysis it’s internal advertisements by different perspectives:
-creative strategies employed (there are multiple different angles on just that one),

  • infer what product/category is being advertised,
  • infer which audience the ad appeals to.
  • does this add offend anyone, does it abide by our content policy

It would be impractical to call the full model for every creative, so I am looking for ways to do this through the embeddings.
I was hoping that this “Hack” will make it possible for me to skip a step (or a few)

1 Like

Would it help to use filters in the search? For example, if you add a “Location” property to your embeddings, now you can filter by “Los Angeles” (if that is your goal). Similar to searching for discussions involving “Python” as it relates to “Computer Science”.

And what about adding some good old reliable keyword filtering to your search? Sometimes when I’m having a very difficult time finding that needle in the haystack, being able to add a keyword I know will be in the embedding cuts down on noise significantly.

Again, I can’t visualize exactly what you’re trying to do, but I don’t think you are going to be able to embed your way out of it alone. Nor is there going to be a single prompt strategy that will work in every instance.

No one in the Developer Forum agrees with me on this, but I think the best strategy is to get the data as refined as you can, then give your end users the tools they can use to find the needles they are looking for – and train them on how to use them effectively. Sounds to me like your user base will be sophisticated enough to handle that.

1 Like

do you have a beautiful heatmap for us? :pray:

you could probably publish a paper on that (instructed embeddings) :thinking:


Here is the heatmap from that run.
“Person” is still stronger that “Location”, but in this example i believe that the Location signal is stronger with the location context (first chart)

Publishing is an interesting idea, I’m not sure this merits one on it’s own. My guess is that instructed embedding would be difficult to utilize on it’s own - but it may be a useful step for enhancing features that an additional model could leverage (compared to a model that operates on uninstructed embeddings).
So maybe demonstrating the entire flow from instructed embedding to a useful model (on a more realistic dataset) may be interesting to publish?

1 Like

@SomebodySysop Finding a needle in a haystack is not what i’m trying to do, I am trying to find structure in the haystack (from different points of view)

Consider the advertising platform example, I would like to know over the entire corpus which strategies are being used without pre defining a strategy taxonomy. Strategies change over time, so i would like to leverage the general nature of the LLMs for topic modeling (hopefully that made sense)

You got me on that one. How you get the AI to intuit what a strategy is without first defining the parameters of a strategy is beyond me. Then, being able to classify various strategies without a classification definition. Very interested to see how you figure this one out.