Temporal/Linear Coding with Embeddings?

Diet · May 3, 2024, 6:01pm

Has anyone successfully managed to implement some sort of continuous temporal or linear coding with embeddings?

What am I talking about?

Say you want to encode the relative position of something in a book, or in time.

More specifically, for instance, if you had a number line

[..., 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,...]

and you did a query on it, you’d want “11.45” for example, to be closest to 11 and 12, and then a little further you’d have 10 and 13, further yet 9 and 14, etc.

However, if you’ve worked with embeddings before, you’ll know that 11.45 is probably going to be closest to 11, 45, 4, 5, 1, and the remaining numbers are splattered all over the place. With te3l for example, 11 is closer to 111 than it is to 12 or 10.

So naive approaches are probably all out.

One glaring issue is that this sort of positional coding seems to require a space with negative curvature, but we’re working on the surface of hyperspheres - positively curved.

One approach that comes to mind could be to select some non-geodesic small circle (r<1) on our hypersphere and rotate any given queryable vector along that circle with a theta specified by our linear value.

small circle picture

Spherical circle - Wikipedia

The query, however, wouldn’t be transformed the same way - we’d either use the parallel greatcircle (r=1) or some other small circle that is larger than our queryable circle. If we did this, we’d approximate a locally hyperbolic space.

Now the problem here is that you can maximally use half the sphere, so you’d need to re-compute every queryable vector’s position relative to the query vector’s theta.

Which means that you might as well just multiply the cosine similarity by ~~|cosh(delta_theta)|~~ ~~cosh(-1 * delta_theta^2)~~ ~~cos(delta_theta) (more or less).~~ -cosh(delta_theta).

but you might recognize this function: we’ve just reinvented positional encoding from our favourite paper, attention is all you need (https://arxiv.org/pdf/1706.03762)

My question: has anybody tried this or something similar, with any empirical experience?

N2U · May 3, 2024, 6:24pm

Yes, I’ve read a few papers that seems to implement this idea, but I can only upload some of them here due to format restrictions, but here’s a few:

And here’s a search link for even more resources.

Diet · May 3, 2024, 6:32pm

Sure, but my understanding is that they all (most, at least the LLM ones) concern themselves with bounded ranges in N, this is more about unbounded ranges in R. Although I will admit, that I don’t know if it makes a difference.

And I was hoping for success (or failure) stories from the community.

One example that comes to mind is the discussion with @darcschnider about recalling feeding your dog today. If you applied such a transformation you could maybe identify embeddings representing similar points in time, and help you differentiate between today and yesterday.

N2U · May 3, 2024, 6:46pm

I will also have to admit that I don’t know if it makes a difference, in this case, I would create prototypes using different methods, one for N and one for R, and measure nearest neighbor accuracy to figure out if there’s any difference between them

darcschnider · May 3, 2024, 7:01pm

We’ve developed a sophisticated system for handling temporal and semantic information in our Kruel.AI, leveraging several techniques to ensure the most relevant and predictive data is pulled.

Our approach involves:

Cosine Similarity Over Multiple Data Points:

We use cosine similarity extensively to measure the semantic similarity between different data points. This helps in identifying and ranking the most contextually relevant memories or information pieces based on their vector representations allowing for a more dimensional graph.

Temporal Encodings:

Temporal aspects are encoded alongside semantic embeddings, ensuring that the recency of information is taken into account. This is akin to combining cosine similarity with a temporal decay function to prioritize more recent data without discarding valuable older information.

Clustering Techniques:

Clustering methods are employed to group similar data points together. This helps in reducing noise and improving the efficiency of data retrieval. By clustering, we can manage and retrieve large volumes of information more effectively.

Neural Network Relationship System:

Our system uses a neural network-based relationship mechanism to understand and map the relationships between different pieces of information. This network helps in dynamically adjusting the relevance scores based on the evolving context and data patterns.

Hybrid Scoring System:

We implement a hybrid scoring system that combines the cosine similarity score with a temporal scoring function. This ensures that both the semantic relevance and the temporal proximity are considered in a balanced manner.

These methods along with some others collectively allow our system to retrieve the most contextually appropriate and temporally relevant data, making it highly effective in dynamic and real-time applications.

I found several of the ideas discussed above here very intriguing and believe they could further enhance our system. In particular the concepts around more sophisticated temporal encodings and adaptive nearest neighbor algorithms, in particular, offer promising avenues for expanding our current capabilities. It’s always exciting to see innovative approaches. Does make me cry a little though haha every new idea means more coding… I think AI is a life project haha.

curt.kennedy · May 6, 2024, 2:37pm

Have you thought about just time stamping the embedding, as an additional metadata field?

You can mix this into retrieval by using something like RRF, where you mix time and correlation into your overall ranking.

Same sort of thing for positional data, but here you have the positional offset, which can be mixed in as well.

I had more details on “aging” the timestamps as well, to declutter things and boost performance.

So the application would be news stories, where newer is better, and you want good separation with RRF, and so you demote (age) the timestamp, on a schedule, by zero filling the UNIX timestamp from the right.

Also with the repetitive “alarm” scenario, of “feed your dog at 8am”. The aging process puts all the information in the same time bucket over time, and you can detect common/repetitive events, and treat them differently if you wish. But the detection/clustering rate is determined by your aging schedule.

So you quantize time over time, similar to how our brains work.

Diet · May 6, 2024, 3:15pm

Indeed. I’m not a fan of RFF though at the moment, because it doesn’t make any intuitive sense to me yet.

would represent what you’re suggesting. It’s a filter applied to the timestamp delta. But instead of cos, a gaussian function might make more sense. We’d apply that to the cosine similarity.

with an infinite c we’d have the regular (360°, 0s) cosine similarity, and as it shrinks we increase the focal length in time until we have a macro shot of the last embedding.

We could have some precalculated Cs and Bs depending on the questions we’re asking (yesterday, an hour ago, two pages ago, last year, last chapter), but I’m wondering if this could be done dynamically too.

If you’re using an adaptive nearest neighbor algorithm, you could use information about the neighborhood to inform C. maybe.

I don’t like quantizing continuous stuff at the data level because if you do that, you’re introducing arbitrary scope constraints

Of course, if you just need a quick and dirty solution and you know that your questions will always be about whether you did it yesterday or today, then that might absolutely make sense.

curt.kennedy · May 6, 2024, 3:36pm

RRF is a bit odd, but it only applies to ranking real or integer valued things together with the kernel 1/x, which has a long tail.

I use the exponential weighting to distinguish things that are really close together in the cosine similarity sense.

To apply “RRF” in this exponential fashion, if it makes more sense to you, then switch the kernel to exp(-x^2), or some variant thereof.

This kernel has a sharper impulse response, and is more selective than 1/x.

But when combining embeddings, I don’t use RRF or RSF anymore, I combine them coherently by taking a weighted average of cosine similarities. The RRF or RSF are non-coherent techniques, and apply to rankings mostly, from what I can tell.

This is an interesting approach. I would be curious if it works out if you find a working algorithm that determines the coefficients.

You could also just introduce a score or value based on the actual age without quantization. Quantization was just easy to compute and handle, without taking distances. But the continuous scale would lend itself to more information, which usually is a good thing.

N2U · May 6, 2024, 6:42pm

Go write you’re own RRF function, it will make sense quite quickly, and it’s a really good thing to have in your toolbox

Trust me it will help

Diet · May 6, 2024, 7:19pm

Using adaptive nearest neighbor, exponential time falloff:

hmm, interesting, maybe:

all items are almost orthogonal to each other. To us, it looks like we’re looking at a single very sparse cluster.

time is still overrepresented, but we can see more things coming into view

max the dog is starting to pop out of the cluster

as soon as max popped out of the cluster, he became the most likely candidate as well, overtaking whiskers.

another item pops out of the cluster, but doesn’t overtake our result on emergence

whiskers is starting to sod off completely

3 more steps

This is just two categories. but a potential algorithm could be:

find the boundary between n_proximal = 1 and n_proximal = 2 (can we set up newton/gradient descent for that?)
the results at the boundary should be able to answer your question. maybe.

(work in progress, obviously)

proximity function

theta is actually c^2, in the graphs it's also actually root c^2

def compute_cosine_similarity(root_vector, vector, theta):
    """Calculate eroded cosine similarity between two vectors."""
    #cosine_similarity = np.dot(root_vector, vector)
    #cosine_similarity = util.cos_sim(root_vector, vector)[0].item()
    #print(f"vector: {vector}")
    #print(f"rv: {root_vector}")
    v = vector['vector']
    rv = root_vector['vector']
    cosine_similarity = util.cos_sim(rv, v)[0].item()
    
    #theta factor
    delta_t = np.abs(root_vector["timestamp"] - vector["timestamp"])/(1000*60*60*24)
    x = delta_t
    y = cosine_similarity * np.exp((-((x**2))/theta))
    #print(f"coss: {cosine_similarity}, y: {y}, theta: {theta}, dt: {delta_t}")
    
    return y

Diet · May 6, 2024, 7:28pm

My understanding is that it’s just

for each item:
    for each ranklist:
        item.rank += 1/(ranklist.indexof(item) + some const)
sort(items by rank)

I’m saying that I don’t know what this means. I don’t know where k comes from (other than feelsgoodman), and I don’t like how each ranking algorithm is assumed to be equally performant.

curt.kennedy · May 6, 2024, 8:04pm

You can generalize this to a weighted RRF by changing the numerator 1 to some other weight. This would then factor in the quality of each ranking, but unfortunately, it won’t factor in the density of how close or far apart things are ranked.

The benefit, though, is you can combine dissimilar things based on rankings alone. Think of combining keywords and semantics into your final ranking.

Also think of the constant k as a way to boost or decrease the rank for a particular stream. But realize that for k large, you are in the linear portion of 1/x, and for k small you are more non-linear. So small values of k can lead to more expressiveness, and obviously higher scores. And large values lead to smaller scores, and more predictable contributions.

But like @N2U mentioned, I would at least be aware of RRF, since it’s fairly easy to implement, and gives reasonable results, and allows you to fuse rankings across disparate domains, like embedding vectors and keywords.

Note: Above, timestamps and cosine similarity rankings are disparate, so I would look at RRF as a rank fusing technique.

Diet · May 6, 2024, 9:29pm

Of course.

But I guess that’s my problem. To me it feels like bashing rocks together

just restating what you said.

As you say, we’re quantizing multiple metrics by order, so any information about the topologies of these metrics get erased. We’re assuming that the topologies are flat, but with LLMs, we know that the we’re operating on spheres. Any information about clusters or neighborhoods get canned. Then we apply this function:

I mean fair enough. I personally don’t see it, but I could be convinced that there could be a foundation there. The reciprocal rule is often found in nature, But I’m struggling to find the relation.

I don’t deny that it’s a quick and dirty way of merging random ranking algos.

But you sacrifice all explainability, especially if decide to rerank in multiple stages. I’d only consider it as a very last resort

curt.kennedy · May 6, 2024, 9:43pm

It’s the same as the harmonic mean, used in a bunch of places, like physics, math, engineering. It’s also related to the natural logarithm, which shows up everywhere, including information theory, which I think is why it is a strong candidate in this context.

I’m intrigued, do you have an example of this sort of multiple stage re-rank using some other algorithm?

Diet · May 6, 2024, 9:57pm

Well, in the example above we just inserted an exponential time dimension that could be used as a ranker.

We have:

cosine similarity
neighbor occlusion
exp delta t

but instead of ranking with arbitrary parameters, and then reranking, we can explore the combined space to see what it means.

Right now I don’t know if it’s still differentiable, but it could be. And if it is, we can potentially find an objective optimum for our parameters.

curt.kennedy · May 6, 2024, 10:08pm

Not sure what you mean by “neighbor occlusion”, but the other two are either smooth or continuous. So there is a good chance at derivatives existing.

I don’t think the model internals are on the hypersphere, since this is arbitrarily scaled for us so we can do simple math for cosine similarities.

But the bulk of the model information is, since phase is more important than amplitude, and so the amplitude can be discarded and set to 1, without much loss of information.

But agree the RRF is somewhat “unexplainable”. That’s why it’s not my first choice either.

Diet · May 6, 2024, 10:16pm

I need to think of a better name. I’m thinking that just because a neighbor is close to you, doesn’t mean that this neighbor is super salient. So I’m building a sort of occlusion map like you have in 3d games - we don’t really care about information that’s inside or behind some object. If we did, we’d be inside or behind that object.

N2U · May 6, 2024, 11:14pm

This sounds like reinventing the wheel, or rather the Hierarchical Navigable Small World (HNSW) algorithm, if you look at the layers below and imagine these as representing traversal across decreasing levels of occlusion, you may understand why I’m making this analogy:

My suggesting here is to drop the neighbor occlusion bit entirely, as the result are usually very close to just using cosine similarity.

This is the reason why the RRF function feels like banging rocks together.

The standard RRF function doesn’t account for the individual scores, just the rank, but it doesn’t have to be like that, just normalize your scores and use those instead of the rank itself and you’ll retain all of this information through to the end

N2U · May 6, 2024, 11:25pm

That is exactly where it comes from

K is just a number you Yolo, smaller K values makes the system more sensitive to changes in rank position, larger values of K reduces the differences between adjacent rank.

Diet · May 6, 2024, 11:25pm

It’s possible that HNSW is the same thing, but I don’t trust it. Most trees should be able to guarantee an optimal solution - HNSW doesn’t, and I don’t understand why. (I mean I do for HNSW, I don’t understand how it can be trusted to deliver the results you need apart from using large parameters - even then it could potentially be a crap shoot.)

Very well could be, but I’m reinventing everything I don’t understand until I do

My qualm with HNSW is that it looks like it was invented for flat spaces. I’m not sure how well it generalizes for spheres.

Don’t get me wrong, I’m still using it, but the occlusion thing is more of a method to eliminate duplicate results

Topic		Replies	Views
Some questions about text-embedding-ada-002’s embedding API	146	45792	December 13, 2023
Creating a Chatbot using the data stored in my huge database Community embeddings , chatgpt , fine-tuning , api	93	94314	November 25, 2023
BERT better than Ada 002? API embeddings , api , ada002	11	7056	November 13, 2023
The length of the embedding contents API	48	35820	December 13, 2023
New OpenAI Announcement! Updated API Models and no more lazy outputs API embeddings , gpt-4 , gpt-35-turbo , api	15	6451	February 5, 2024

Temporal/Linear Coding with Embeddings?

Related topics