Combining Different Latent Spaces?

Hi,

I found a post here about comparing cosine similarities over time.

I am doing something similar but a bit different. I have created an information retrieval engine that operates over document embeddings in different languages, reusing the same categories to index those embeddings in a projected feature space of cosine similarities.

There’s a faceted browser which allows users to combine features in a context sensitive way and browse the pooled cosine similarities.

My question is: does the community think this could have any use for aligning document embeddings in different languages?

i.e., if I reused the same features and indexed a large number of vector databases with anchor vectors that represented those features, and pooled them continuously until I had one very large cosine-similarity feature space, would it be useful for combining documents, or would it just contain meaningless noise and strange artefacts? Could I use it to align different embedding spaces over time? Would it be useful for training a model or for RAG?

Or is this approach just extremely naive?

Hey there and welcome!

Sounds like a neat project.

Just to be clear here, do you mean like, different natural languages that represent the contents of the docs, or different programming languages here?

Perhaps yes, but perhaps this might also be rather over-complicating something that could be solved with a simple graph database? You can essentially just make your own pool of vectorized data that has any kind of arbitrary connections/clusters or links between nodes that you’d like. Myself and others here really like neo4j because of the cool ways in which you can cluster and vectorize things for easy RAG with an LLM.