Adding values to ADA-002 embeddings?

IC_RD · September 19, 2023, 8:00pm

I want to adjust the embedding vector values as I get user feedback. For example if the user labels 2 documents as relevant to the query, the I want to adjust the documents vectors to be closer to the query.
Would that potentially work? I am not sure how ADA generates the vectors so it is tough for me to determine if such a method of incorporating user feedback into the document retrieval process would be helpful

Foxalabs · September 19, 2023, 9:21pm

This sounds non trivial, unless I’m missing something very basic, this sounds like a very complex mathematical problem, you can perform operations on vectors, but fine tuning them based on human feedback is going to get real interesting real quick.

@curt.kennedy may have some ideas.

IC_RD · September 19, 2023, 9:27pm

I was thinking of finding the difference between the query vector and the document vector. If the user had labeled the document vector as relevant then 0.8 of the vector difference would be added to the document permanently. That way this would boost its ranking in cosine similarity very significantly for that query. However, this would definitely affect the document’s cosine similarity toward other queries in unknown ways.

Foxalabs · September 19, 2023, 9:39pm

Are you thinking of adding some kind of a meta header with a cosine offset for each embed? Then … then store that embedding back with the modified offset … then if that particular chunk is retrieved in the future you de-rank/up-rank it based on the metaheader offset?

That would actually work.

Foxalabs · September 19, 2023, 9:45pm

So… every embedding starts off with a 0 offset and then if a human thinks this particular retreval chunk is great or bad they can shove a little offset on. A “rank” value that is then used to boost or reduce a chunks likely inclusion in any given Top_k list of chunks…

Still going to get unpopular chunks being retrieved, but you would have a form of RLHF.

curt.kennedy · September 19, 2023, 9:50pm

Instead of an offset number for the user/query/document, I would add an offset vector.

So User: Query → OffsetVector → correlate with QueryVector + OffsetVector.

Where OffsetVector = DocumentVector - QueryVector.

This way you are getting the exact document vector back for the same query, and theoretically a similar document back for a similar query (if you use this same OffsetVector in a small neighborhood of the QueryVector). But you can have it match identically too (no neighborhood, just the point vector exactly at the query).

Foxalabs · September 19, 2023, 10:00pm

So how would many people saying this chunk is great make that chunk more likley to be used in the future then? I’m not sure I follow how that would be done, I guess the same for the antipode, how would you reduce the likelihood of a poor chunk (as voted by humans).

curt.kennedy · September 19, 2023, 10:08pm

You could make this offset private to the User, or “democratize” the offset, and broadcast it to all or some Users, if a certain high percentage of trusted Users decides this is a good choice. It would depend on the situation.

Foxalabs · September 19, 2023, 10:19pm

I guess you could also store this offset in a separate database, so long as you have a unique index value in each chunks meta header…that works.

Just thinking out loud for ways to give users personalisation and some degree of management… I guess each chunk has a unique index key anyway… so… spit balling

curt.kennedy · September 19, 2023, 10:22pm

Yes, the offset vectors are in a different database. First a query comes in, if we are talking a neighborhood, then you correlate against this and find the most correlated vector within the neighborhood tolerance, otherwise your offset vector is the all zeros vector.

If it’s point wise, you can also just use a hash, without any vector math search involvement.

_j · September 19, 2023, 11:10pm

That’s the main thing here. The qualities of underlying semantics.

If I say to myself “here’s the best match for my semantic search, I’m going to average in a little bit of my input vector to the database chunk vector, to bring it closer to that type of search.” You’ve colored the result in different ways than you perceive.

Your database chunk now has a bit of the appearance and qualities of the input to it.

Consider:

popularly-searched chunks get weighted more than those that linger waiting for someone that needs them later.
when someone types in a “short” “informal” question “without capitalization”, you’ve now got some chunks that are activated more by the format of the question than their knowledge quality.
(And a bazillion other semantics, like is the question closer to George Foreman or Instant Pot.)

anon10827405 · September 19, 2023, 11:23pm

Yeah I feel like this could be considered purposefully glitching ADA’s embedding universe and have a lot of strange side effects.

The only way I can see this functioning correctly is by having the typical embeddings return the similar documents, and then also re-ranking it with a separated recommendation engine based on semantics of the query & other users actions (and then an investigation on why this is happening)

Something as simple as a centroid calculation may be suitable. You are trying to “pull” the query and the aggregated preferred result together. So a central point would naturally do this ( and can also be done on the same database if using weaviate ! )

curt.kennedy · September 20, 2023, 3:35am

The OP is asking for a translation from the question to the correct answer. A common problem is that the question doesn’t correlate with the answer.

A common solution is to create a series of questions that are then associated with the answer. So InputQuestion —> HiddenQuestion —> DocumentAnswer

The problem is that this is very manual, and doesn’t scale.

If a user is in the loop, then use simple vector math to translate the question embedding to the document answer.

This scales, adds a marginal amount of latency for the initial offset search (a much smaller space than the entire set of document answers).

It can also be cloned to the entire group of users, or a subset of users.

Scalable + Epsilon latency >> Not scalable.

>>

anon10827405 · September 20, 2023, 4:18am

Agreed. So many ways to ask the same thing. Sometimes it’s just flat out wrong. That’s why recommendation engines are awesome! They support subjectivity in the query!

I feel like the result of the centroid would end up being somewhat similar to the result of the offset. It is a very simple calculation & average the users that have paired the document they wanted to the query they used. One database though.

I tried to find some information on how it works but had to straight to the source which is in Go. Fortunately it wasn’t too rough on me to find. (It’s just a simple mean calculation of the vectors).

github.com/weaviate/weaviate

modules/ref2vec-centroid/vectorizer/method_mean.go

main


      
          //
          //  Copyright © 2016 - 2025 Weaviate B.V. All rights reserved.
          //
          //  CONTACT: hello@weaviate.io
          //
          
          package vectorizer
          
          import "fmt"
          
          func calculateMean(refVecs ...[]float32) ([]float32, error) {
          	if len(refVecs) == 0 || len(refVecs[0]) == 0 {
          		return nil, nil
          	}
          
          	targetVecLen := len(refVecs[0])
          	meanVec := make([]float32, targetVecLen)
          
          	// TODO: is there a more efficient way of doing this?
          	for _, vec := range refVecs {
          		if len(vec) != targetVecLen {

Definitely more computational costs but I think the results are worth it!

curt.kennedy · September 20, 2023, 1:47pm

I think this can be done with my vector offset system.

For example. The user types “GPS Watch” in the search bar.

The phrase “GPS Watch” is embedded and first correlated with the layer of offsets … I’ll explain further:

So, “GPS Watch” now correlates to these top three things in the initial offset layer:

“Garmin Epix 2”
“Apple Watch Series 5”
“Samsung Galaxy Watch 5”

These three things don’t contain any descriptive content, and this is what you are hoping to retrieve.

OK, so how do you do this? Well, for each of these three hits, since you are in the initial offset layer, you also get an offset vector. So you add this offset vector to the embedding vector of the search to get a new vector (say the one in the offset, not the original to get a solid redirect, so “Garmin Epix 2”, not “GPS Watch”, but the vector offset allows you to abstract this to “GPS Watch”, so try this if you have a dense set of products established), one vector for each of the three items.

Then you perform three searches, one for each vector, but now you are searching your product database embeddings. So let’s say for each search, you retrieve the top three items as well. So now you have a total of 9 items.

You can now present these 9 items to the user (or to the LLM), even rank them in some manner.

This is the essence of “vector redirected search”. Where the redirect is pointing in the direction of the product ultimately chosen or related to the search query. And because of the continuity of embeddings, you also get other related things that were never mapped before, for example, you drop in a new item that has no vector pointing to it, but it is now exposed in the search.

If the user buys this new product, you could then go back and have a vector pointing to this new product from the original search. So over time, your search results will improve.

IC_RD · September 20, 2023, 1:54pm

I really like your idea about having private vector offsets for each user!

I previously thought the offset should be added to the document itself so that it could change the vector position of the document to be closer to the query.

I understand using the offset vector to modify the query instead can be more effective since it maintains the document’s original embeddings; however, I just want to make sure I am understanding the middle part where the offset vector is matched with the query.

Is the offset vector stored in the metadata and the vector position is just the original query’s vector? That way the offset vector can always be found by the same query or similar queries?

curt.kennedy · September 20, 2023, 2:03pm

There are two databases, one for queries, and one for products (or target documents).

So the query comes in, it then is correlated to previous queries, and you get two vectors for each entry, one representing the query itself, and the other representing the offset pointing to the target document. So a database schema like this:

There is no meta data mixed into the document side from the query side.

If you find no suitable query offset for a new query (the prior queries are not correlated well to the new query), you would then search “undirected” into your products/documents. I would use a combination of dense and sparse vectors (so embeddings and keywords), with a hybrid ranking, to get a suitable match for a new unknown query.

If then this new query has been proven to have a suitable match in the document side, then you add this information as a new entry into the query database.

So over time, your searches improve, and you have a suitable fallback for the unknown case, which will happen a lot initially.

IC_RD · September 20, 2023, 2:41pm

Okay I see that makes a lot of sense! I reread your previous comment too and it seems like the multimodal case is also handled that way.

Foxalabs · September 20, 2023, 4:17pm

I think you should do this for a living, Curt!

anon10827405 · September 20, 2023, 4:33pm

Yes, I think you’re right that it can be done with your offset system.

I think your suggestion can be considered the initial calculation for a recommendation engine (after all, a centroid can also be considered an offset vector if it is a mean calculation of all the vectors of queries people have matched to a document).

This is a recommendation engine by definition! The main difference is that a centroid calculation can be done in the same database as the products in your use-case. The original vector and it’s centroid are already calculated, work can be performed on them both as they are not inter-dependent.

Let’s try a different example: YouTube videos. Highly subjective. An item that gets clicked on based on the search query naturally performs better in that search. This is a centroid that can get mixed with other centroids. So if a user types “Police” they are matched with videos that other people clicked on from the query, and then also sprinkled with the user’s own preferences.

The centroid can also be separated and used as a “Other people watched…”. It can be used independently. Each different layer is modular and independent.

@IC_RD Please let us know how this turns out in practice.

Topic		Replies	Views
Document Sections: Better rendering of chunks for long documents Prompting vector-db , semantic-search	66	33101	April 1, 2025
Embedding does not capture negative expression? API embeddings	17	3248	January 11, 2024
Embedding - text length vs accuracy? API	13	16729	December 25, 2023
What is a proper way to combine multiple cosine similarities? API embeddings	17	1985	March 24, 2024
Scaling RAG chatbot system to millions of documents API gpt-4 , prompt-engineering , rag	18	7157	February 28, 2024

Adding values to ADA-002 embeddings?

Related topics