RAGxplorer: Visualising Document Chunks in the Embedding Space

cyzgab · January 20, 2024, 4:20pm

Hello!

I built a simple streamlit app to visualise the embeddings of document chunks vis-a-vis the RAG query!

text-embedding-ada-002 is one of 2 models used here.

Hope it’s helpful!

Diet · January 21, 2024, 4:25am

Very cool! What’s your projection method?

cyzgab · January 21, 2024, 2:51pm

I’m using umap

merefield · January 21, 2024, 4:18pm

But this is just 2 dimensional? What about the other 1,534 dimensions?

Why wouldn’t this end up potentially being incredibly misleading?

cyzgab · January 22, 2024, 3:26am

You’re right that some information will be lost with the dimensionality reduction.

That said, the retrieval is done in the full embedding space using a distance metric like cosine similarity.

It’s only for inspection/visualization purposes that it’s projected onto 2D.

If you notice in the diagram, the nearest points in the full embedding space are not the ones nearest in the 2D space.

The idea of this tool is to see what other points might be nearby, but are not retrieved because they could have been marginally missed in the full embedding space. One can also experiment how changing the chunking, embedding model or even the retrieval strategy (ie going beyond just top K chunks) affects the specific chunks chosen.

You can also inspect the documents themselves & if they are clearly unrelated to the query, it may suggest needing to change the various steps in the RAG pipeline.

merefield · January 22, 2024, 8:49am

Yup, so you are comparing distance only on two dimensions, makes sense.

That is quite a lot of information lost though?

For example, clustering could be misleading? It will just indicate things which are “similarly different” but not necessarily “different in the same way”?

But appreciate given the limitations of the human mind and presentation formats, this is nearly as close as we can get to visualisation?

A 3D option wouldn’t make much of an inroad?

cyzgab · January 24, 2024, 6:08am

When comparing it visually, yes. But the actual retrieval (i.e. calculating cosine similarity) is in the full dim space.

I’ve personally found it hard to use 3D visualisations on a 2D screen. Maybe we’ll have something really 3D with the apple vision pro

cyzgab · January 24, 2024, 6:10am

Made some changes to the repo in the experiment branch.

The idea’s to stripe out the code from the streamlit app, and make it into a package.

This is my first time doing this, so any advice would be most appreciated here.

Here’s a code example of the current api:

Installation

git clone -b experiment https://github.com/gabrielchua/RAGxplorer.git
cd RAGxplorer
virtualenv venv # create a new virtual env
source venv/bin/activate # activate the virtual env
pip install -r requirements.txt

Usage

from ragxplorer.ragxplorer import Explorer
client = Explorer(embedding_model="text-embedding-ada-002") # Please ensure "OPENAI_API_KEY" is set as an env variable
client.load_document("presentation.pdf")
client.visualise_query("What are the top revenue drivers for Microsoft?")

cyzgab · January 27, 2024, 4:41pm

It’s now a package on PyPi

The latest OpenAI embedding models are supported too!

pip install ragxplorer

Topic		Replies	Views
💬 Training an embedding adapter: adapt embeddings to new context, and boost the performance of RAGs, API embeddings , api	1	1135	February 25, 2024
Visualising vector embeddings in the browser Community embeddings	3	2103	March 25, 2024
How I cluster/segment my text after embeddings process for easy understanding? API	13	11219	December 18, 2024
We've been building the open source ultimate RAG backend and are launching our V2 Community gpt-4 , plugin-development , api	9	1799	January 5, 2025
Deploying RAG with SciPhi: A Quick Overview Community project , rag , gpt-development	0	382	March 30, 2024

RAGxplorer: Visualising Document Chunks in the Embedding Space

Installation

Usage

Related topics