Lmao, if you think about it, the entire goal of science is the discovery of knowledge, but thatâs the end result, the starting point is basically always filled with knowledge gaps that youâll have to yolo ![]()
I 100% approve of this, building it yourself is the best way to fully confirm that you actually understand what youâre doing ![]()
The ground beneath your feet is flat, but the earth is round ![]()
That is exactly what RSF is ![]()
So RRF is rankings only, and the spinoff algorithm, RSF, ranks with respect the densities of each stream. So if you have the metrics behind the rankings, you could use a normalized version of those, instead of the actual rankings.
Maybe RSF is more intuitive? There is no inverse ranking nonsense like RRF. @Diet ???
I think I understand what you mean.
For example, if you are using cosine similarity for semantics, you get your local neighborhood of similar things (semantically). But then you use an exponential weighting on some other dimension, like time, to then spread them apart, or âoccludeâ the close neighbors, to bring forth only the time appropriate things.
ADA was flat (-ish sorry ADA, you were still our first love
)
te3 and most âsmartâ (very?) large embedding models are very curvy.
Yeah but I think itâs a good idea to ground our ivory towers as much as possible as opposed to yeeting yoloed yolos to see what sticks ![]()
The worst thing that can happen is that I embarass myself again ![]()
Yep ![]()
Iâm actually using this, with the only extra thing being a âmethod weightâ, Iâm fairly sure this is pretty vanilla, but it seems to do the job very well.
Fair point, but just remember that between the yoloâing and the end result, thereâs a huge amount of benchmarks and testing to be done ![]()
In 2d:
Youâd only get A, D, and H as search results. everything else is occluded/overshadowed
Note for example that D is further away than B, but is still a result while B isnât.
This is independent of the time aspect.
If you add an extra dimension and in this case tweak c (in case of exp delta t/c^2), you sort of rotate these relationships. Like the clusters start spinning about some axis (I think orthogonal to time?)
anyways, a bunch of stuff starts to rotate and youâd need to recompute the occlusion thing, youâd get a new map.
If you lower c to approach 0, you sort of uncurl the entire space and all the clusters seem to turn into a very sparse line as far as I can tell
OK, I see what you are shooting for, but I am confused on how you are rotating this hyper-dimensional neighborhood with a scalar. Shouldnât you be using a proper rotation matrix?
And what dictates the rotation angle?
So you would have to find the axis of rotation, and how much to rotate, then apply a matrix transformation, then correlate with the dot product of these rotated vectors. And you need to do it with all these neighborhoods, since each patch has an independent rotation axis and rotation angle.
![]()
I think itâs just a perspective thing
Trying to derive an answer for you
So the thing is, youâre asking questions I donât have the answer to yet ![]()
The following is me trying to solve this in real time ![]()
Iâm just reporting what Iâm observing here. So itâs quite possible that what Iâm describing as a rotation is just the result of a translation on the hypersphere. If you just look at the relative relationships on the periphery as far as you can see them from your PoV (your query vector), things seem to start moving in weird circular ways when you take a step in any direction.
I guess itâs sort of like looking through a fisheye lens?
https://giphy.com/clips/storyful-australia-building-and-new-zealand-URkAVz1p99QxSooLsp
But the bigger issue here is that almost all structures are very curled up around around each other. since we have 3000 curved dimensions and one linear time dimension, we have a sort of high dimensional cylinder. Any chain of events would be some sort of hyperhelix.
But the time axis is very sparse.
But we know that the perception or importance of time is relative. so we locally compact and distally sparsify t. And c just tells us what portion we compact by how much.
Why do things rotate when we adjust c?
It might be easier to visualize with heaviside step functions. Imagine that you have a helix:
src: Helix -- from Wolfram MathWorld
if you sliced it linearly your features would appear to be spinning ![]()
The exp function dictates what slice of your hyperhelices (features on 3000 sphere base + flat time dimension) yourâre looking at. So changing c (the width) of that spins stuff. With infinite c you basically compact everything and effectively erase the time axis.
which this pic shows
time axis gone
Yeah, itâs an improvement, I guess.
the problem I foresee here with a linear axis is that an outlier will absolutely wreck your shtuff.
side note: most of that reranking stuff seems to be intended for augmenting keyword search with weak embeddings, for which it works decently well. Iâm thinking that we wonât need that stuff anymore with stronger embedding models - Iâm wondering what a llama-3-70b-embedding-instruct could do.
The exp function acts as a 2-d rotation in the complex plane, which is what the âAttention Is All You Needâ paper is using.
The paper is using the real and imaginary components of
exp(-ix)
But Iâm not seeing how you are doing this in higher dimensions, since you would project the vectors into the complex plane, rotate, then invert back to higher dimensions and correlate.
So multiplying.by a real valued scaler wonât do this. And a complex valued scaler will rotate, but only in 2d, where the 2d is interpreted as complex valued. Again, not seeing it, or how it extends to N dimensions with the code you have.
What line of code is projecting N dimensions to 2?
There is no rotation in the complex plane here. There is no imaginary term. Instead of exp, you could use a bunch of other filter functions, like -cosh+2 or cos or something.
I think the rotations are just perspective artifacts.
the query is (0,0). the closest neighbor falls on r=cosim, theta = 0. the second closest neighbor falls on r=cosim, theta cosim(1st neighbor). Iâm just projecting as many triangular relations as I can.* (theta for first order neighbors gets squished and rescaled if there are too many of them)
edit: Iâll define a rotation as such: assume you have three points: S, A, and B. a transformation that causes the distance SA to increase and the distance SB to decrease would be a rotation.
here this just happens because c (in the gaussian form of the exp function) locally alters the relation of the angular component to the linear component to locally different degrees.
OK, so it looks like you are using polar coordinates to visualize the various cosine similarities.
But what does this perspective provide?
I mean, I like it, itâs cool, but the angle is just the cosine similarity.
So you expect small arcs for tightly correlated things, and bigger arcs for uncorrelated and spread out things.
But the rotation acts as a bias to the cosine similarity. Whatâs good about this bias? Especially when it aliases back every 360 degrees ![]()
Aliasing kills everything, unless you use it cutely ![]()
hmm, itâs possible that this view is not super intuitive.
the radius indicates cosine similarity to your query vector. this is always accurate.
the angle is more for second order neighbors. it tells you how far they are away from your root neighbor.
root neighbors are angularly spaced by cosine similarity between each other if thereâs enough space. if not, theta gets scaled down so they all fit.
It canât really, the max is +90 degrees for second order neighbors. and their angular position is only relative to their parent.
It lets you know how your proximal points relate to you, and to a degree, to each other.
Itâs obviously a wip, but it originated from me not really being able to get force directed graphs working in any meaningful fashion, and other dimensionality reduction approaches (incl SNE) not really producing any meaningfully interpretable results ![]()
I get it. Itâs certainly a compact non-linear way of visualizing cosine similarities!
And it wonât truly alias because the radius is continually dropping off ⊠you are correct.
The rotation of the spiral acts as a DC bias to cosine similarity.
This is actually good, IMO, for biasing the strength of one model over another.
Whatâs weird though, and maybe what you are thinking, is if you treat these as complex numbers, and add them across different models?
Each model will have its own bias (rotation) and spread (arc length), relative to the other models. Wondering if this is the direction you were thinking.
However, what I do is note the spread and bias of each model, and add the cosine similarities, which is a coherent operation, and better than RRF or RSF since it synthesizes a giant embedding model, like the high dimensions of the original DaVinci embedding model, and contains more detail than any single model.
You could try to formalize a calculus on these arcs to synthesize some gain, but I find it more intuitive to combine them is the most direct and canonical way, and note the individual model characteristics, so I can understand the detailed DaVinci sized monster I just created ![]()
Rotation between the results, yeah. The closest vector is always pegged to 0°.
te3l vs mistral
(pulled out of the private channel)
this is te3l: (this still had linear radial scaling)
this is sfr-mistral:
I actually initially just wanted to test and visually debug instruct embeddings. Am I hitting a bullseye, is my prompt garbo, or am I missing data?
![]()
doing that youâre effectively normalizing each component by 1/dim.
I guess we could visually inspect whether youâre actually gaining or erasing resolution. If (in the te3l vs mistral example) we called the difference between true and false the âconfidenceâ band, Iâd expect your addition approach to just give you the average of the two.
![]()
dumb idea: generated complement embedding gaussians for confidence amplification in model fusion
- take your query vector
- construct a complement to your query q (e.g.: true => false, did I feed my dog? => did I starve my dog?)
- for each model m, element e;
3.1 conf_m_e = (cosim_m(q, e) - cosim_m(qâ, e))/2. note, sign is preserved
3.2 mean_m_e = (cosim_m(q, e) + cosim_m(qâ, e))/2. - for each element:
4.1. conf_e = sqrt(sum_m(conf_m_e * |conf_m_e|)). variance, but preserve sign. if sqrt imaginary, multiply *i.
4.2. cosim_e = sum_m(mean_m_e /count_m) + conf_e
![]()
might need to work on some kinks, but the idea is that youâre stacking gaussians. agreeing models will improve the score of an element, while disagreeing models will push the element away.
I forget to ask why are you embedding a formulaic pattern over and over? The pattern is âfed Whiskers on {Date}â.
The embeddings are just discerning semantic differences in Date, which is an ill posed problem for a semantics engine, right?
There is no one size fits all solution, but why not try to use regex to extract everything with a date and replace the date string with {Date}
This way you get the constant string âfed Whiskers on {Date}â, and therefore a static embedding vector. Now you donât need to worry about these minor variations in semantics, since they have been flattened by your regex prefilter.
Ah, those are just labels for the graph
The actual text is something like
âWell todayâs Monday, I hate Mondays but at least max is there to keep me company. I guess we still have some leftover steak and Iâm sure that max would absolutely go crazy for itâŠâ
Iâll post the data when I get back to the office













