Temporal/Linear Coding with Embeddings?

Lmao, if you think about it, the entire goal of science is the discovery of knowledge, but that’s the end result, the starting point is basically always filled with knowledge gaps that you’ll have to yolo :rofl:

I 100% approve of this, building it yourself is the best way to fully confirm that you actually understand what you’re doing :hugs:

The ground beneath your feet is flat, but the earth is round :thinking:

That is exactly what RSF is :rofl:

So RRF is rankings only, and the spinoff algorithm, RSF, ranks with respect the densities of each stream. So if you have the metrics behind the rankings, you could use a normalized version of those, instead of the actual rankings.

Maybe RSF is more intuitive? There is no inverse ranking nonsense like RRF. @Diet ???

I think I understand what you mean.

For example, if you are using cosine similarity for semantics, you get your local neighborhood of similar things (semantically). But then you use an exponential weighting on some other dimension, like time, to then spread them apart, or “occlude” the close neighbors, to bring forth only the time appropriate things.

ADA was flat (-ish sorry ADA, you were still our first love :laughing: )

te3 and most “smart” (very?) large embedding models are very curvy.

Yeah but I think it’s a good idea to ground our ivory towers as much as possible as opposed to yeeting yoloed yolos to see what sticks :rofl:

The worst thing that can happen is that I embarass myself again :rofl:

Yep :rofl:

I’m actually using this, with the only extra thing being a “method weight”, I’m fairly sure this is pretty vanilla, but it seems to do the job very well.

Fair point, but just remember that between the yolo’ing and the end result, there’s a huge amount of benchmarks and testing to be done :rofl:

In 2d:

You’d only get A, D, and H as search results. everything else is occluded/overshadowed

Note for example that D is further away than B, but is still a result while B isn’t.

This is independent of the time aspect.

If you add an extra dimension and in this case tweak c (in case of exp delta t/c^2), you sort of rotate these relationships. Like the clusters start spinning about some axis (I think orthogonal to time?)

anyways, a bunch of stuff starts to rotate and you’d need to recompute the occlusion thing, you’d get a new map.

If you lower c to approach 0, you sort of uncurl the entire space and all the clusters seem to turn into a very sparse line as far as I can tell

OK, I see what you are shooting for, but I am confused on how you are rotating this hyper-dimensional neighborhood with a scalar. Shouldn’t you be using a proper rotation matrix?

And what dictates the rotation angle?

So you would have to find the axis of rotation, and how much to rotate, then apply a matrix transformation, then correlate with the dot product of these rotated vectors. And you need to do it with all these neighborhoods, since each patch has an independent rotation axis and rotation angle.

:thinking:

I think it’s just a perspective thing

Trying to derive an answer for you

So the thing is, you’re asking questions I don’t have the answer to yet :rofl:

The following is me trying to solve this in real time :laughing:

I’m just reporting what I’m observing here. So it’s quite possible that what I’m describing as a rotation is just the result of a translation on the hypersphere. If you just look at the relative relationships on the periphery as far as you can see them from your PoV (your query vector), things seem to start moving in weird circular ways when you take a step in any direction.

I guess it’s sort of like looking through a fisheye lens?

https://giphy.com/clips/storyful-australia-building-and-new-zealand-URkAVz1p99QxSooLsp

But the bigger issue here is that almost all structures are very curled up around around each other. since we have 3000 curved dimensions and one linear time dimension, we have a sort of high dimensional cylinder. Any chain of events would be some sort of hyperhelix.

But the time axis is very sparse.

But we know that the perception or importance of time is relative. so we locally compact and distally sparsify t. And c just tells us what portion we compact by how much.

Why do things rotate when we adjust c?

It might be easier to visualize with heaviside step functions. Imagine that you have a helix:


src: Helix -- from Wolfram MathWorld

if you sliced it linearly your features would appear to be spinning :thinking:

The exp function dictates what slice of your hyperhelices (features on 3000 sphere base + flat time dimension) your’re looking at. So changing c (the width) of that spins stuff. With infinite c you basically compact everything and effectively erase the time axis.

which this pic shows

time axis gone

Yeah, it’s an improvement, I guess.

the problem I foresee here with a linear axis is that an outlier will absolutely wreck your shtuff.

side note: most of that reranking stuff seems to be intended for augmenting keyword search with weak embeddings, for which it works decently well. I’m thinking that we won’t need that stuff anymore with stronger embedding models - I’m wondering what a llama-3-70b-embedding-instruct could do.

The exp function acts as a 2-d rotation in the complex plane, which is what the “Attention Is All You Need” paper is using.

The paper is using the real and imaginary components of

exp(-ix)

But I’m not seeing how you are doing this in higher dimensions, since you would project the vectors into the complex plane, rotate, then invert back to higher dimensions and correlate.

So multiplying.by a real valued scaler won’t do this. And a complex valued scaler will rotate, but only in 2d, where the 2d is interpreted as complex valued. Again, not seeing it, or how it extends to N dimensions with the code you have.

What line of code is projecting N dimensions to 2?

There is no rotation in the complex plane here. There is no imaginary term. Instead of exp, you could use a bunch of other filter functions, like -cosh+2 or cos or something.

I think the rotations are just perspective artifacts.

the query is (0,0). the closest neighbor falls on r=cosim, theta = 0. the second closest neighbor falls on r=cosim, theta cosim(1st neighbor). I’m just projecting as many triangular relations as I can.* (theta for first order neighbors gets squished and rescaled if there are too many of them)

edit: I’ll define a rotation as such: assume you have three points: S, A, and B. a transformation that causes the distance SA to increase and the distance SB to decrease would be a rotation.

here this just happens because c (in the gaussian form of the exp function) locally alters the relation of the angular component to the linear component to locally different degrees.

OK, so it looks like you are using polar coordinates to visualize the various cosine similarities.

But what does this perspective provide?

I mean, I like it, it’s cool, but the angle is just the cosine similarity.

So you expect small arcs for tightly correlated things, and bigger arcs for uncorrelated and spread out things.

But the rotation acts as a bias to the cosine similarity. What’s good about this bias? Especially when it aliases back every 360 degrees :rofl:

Aliasing kills everything, unless you use it cutely :upside_down_face:

hmm, it’s possible that this view is not super intuitive.

the radius indicates cosine similarity to your query vector. this is always accurate.

the angle is more for second order neighbors. it tells you how far they are away from your root neighbor.

root neighbors are angularly spaced by cosine similarity between each other if there’s enough space. if not, theta gets scaled down so they all fit.

It can’t really, the max is +90 degrees for second order neighbors. and their angular position is only relative to their parent.

It lets you know how your proximal points relate to you, and to a degree, to each other.

It’s obviously a wip, but it originated from me not really being able to get force directed graphs working in any meaningful fashion, and other dimensionality reduction approaches (incl SNE) not really producing any meaningfully interpretable results :confused:

I get it. It’s certainly a compact non-linear way of visualizing cosine similarities!

And it won’t truly alias because the radius is continually dropping off 
 you are correct.

The rotation of the spiral acts as a DC bias to cosine similarity.

This is actually good, IMO, for biasing the strength of one model over another.

What’s weird though, and maybe what you are thinking, is if you treat these as complex numbers, and add them across different models?

Each model will have its own bias (rotation) and spread (arc length), relative to the other models. Wondering if this is the direction you were thinking.

However, what I do is note the spread and bias of each model, and add the cosine similarities, which is a coherent operation, and better than RRF or RSF since it synthesizes a giant embedding model, like the high dimensions of the original DaVinci embedding model, and contains more detail than any single model.

You could try to formalize a calculus on these arcs to synthesize some gain, but I find it more intuitive to combine them is the most direct and canonical way, and note the individual model characteristics, so I can understand the detailed DaVinci sized monster I just created :rofl:

Rotation between the results, yeah. The closest vector is always pegged to 0°.

te3l vs mistral

(pulled out of the private channel)
this is te3l: (this still had linear radial scaling)

this is sfr-mistral:

I actually initially just wanted to test and visually debug instruct embeddings. Am I hitting a bullseye, is my prompt garbo, or am I missing data?

:thinking:

doing that you’re effectively normalizing each component by 1/dim.

I guess we could visually inspect whether you’re actually gaining or erasing resolution. If (in the te3l vs mistral example) we called the difference between true and false the “confidence” band, I’d expect your addition approach to just give you the average of the two. :thinking: :thinking: :thinking: :thinking:

dumb idea: generated complement embedding gaussians for confidence amplification in model fusion
  1. take your query vector
  2. construct a complement to your query q (e.g.: true => false, did I feed my dog? => did I starve my dog?)
  3. for each model m, element e;
    3.1 conf_m_e = (cosim_m(q, e) - cosim_m(q’, e))/2. note, sign is preserved
    3.2 mean_m_e = (cosim_m(q, e) + cosim_m(q’, e))/2.
  4. for each element:
    4.1. conf_e = sqrt(sum_m(conf_m_e * |conf_m_e|)). variance, but preserve sign. if sqrt imaginary, multiply *i.
    4.2. cosim_e = sum_m(mean_m_e /count_m) + conf_e

:thinking:

might need to work on some kinks, but the idea is that you’re stacking gaussians. agreeing models will improve the score of an element, while disagreeing models will push the element away.




image
(it’s not a sigmoid, it’s actually exp(-c/log(x))

old data

I’m dropping this here for now, explain later.

hypothesis tested: highest ascent velocity implies maximum salience

(graph is inverted, 90° is cosim 0)

I forget to ask why are you embedding a formulaic pattern over and over? The pattern is “fed Whiskers on {Date}”.

The embeddings are just discerning semantic differences in Date, which is an ill posed problem for a semantics engine, right?

There is no one size fits all solution, but why not try to use regex to extract everything with a date and replace the date string with {Date}

This way you get the constant string “fed Whiskers on {Date}”, and therefore a static embedding vector. Now you don’t need to worry about these minor variations in semantics, since they have been flattened by your regex prefilter.

Ah, those are just labels for the graph

The actual text is something like

“Well today’s Monday, I hate Mondays but at least max is there to keep me company. I guess we still have some leftover steak and I’m sure that max would absolutely go crazy for it
”

I’ll post the data when I get back to the office