RAGxplorer: Visualising Document Chunks in the Embedding Space

Hello!

I built a simple streamlit app to visualise the embeddings of document chunks vis-a-vis the RAG query!

text-embedding-ada-002 is one of 2 models used here.

Hope it’s helpful!

8 Likes

Very cool! What’s your projection method?

I’m using umap :slight_smile:

1 Like

But this is just 2 dimensional? What about the other 1,534 dimensions?

Why wouldn’t this end up potentially being incredibly misleading?

You’re right that some information will be lost with the dimensionality reduction.

That said, the retrieval is done in the full embedding space using a distance metric like cosine similarity.

It’s only for inspection/visualization purposes that it’s projected onto 2D.

If you notice in the diagram, the nearest points in the full embedding space are not the ones nearest in the 2D space.

The idea of this tool is to see what other points might be nearby, but are not retrieved because they could have been marginally missed in the full embedding space. One can also experiment how changing the chunking, embedding model or even the retrieval strategy (ie going beyond just top K chunks) affects the specific chunks chosen.

You can also inspect the documents themselves & if they are clearly unrelated to the query, it may suggest needing to change the various steps in the RAG pipeline.

2 Likes

Yup, so you are comparing distance only on two dimensions, makes sense.

That is quite a lot of information lost though?

For example, clustering could be misleading? It will just indicate things which are “similarly different” but not necessarily “different in the same way”?

But appreciate given the limitations of the human mind and presentation formats, this is nearly as close as we can get to visualisation?

A 3D option wouldn’t make much of an inroad?

When comparing it visually, yes. But the actual retrieval (i.e. calculating cosine similarity) is in the full dim space.

I’ve personally found it hard to use 3D visualisations on a 2D screen. Maybe we’ll have something really 3D with the apple vision pro :joy:

1 Like

Made some changes to the repo in the experiment branch.

The idea’s to stripe out the code from the streamlit app, and make it into a package.

This is my first time doing this, so any advice would be most appreciated here.

Here’s a code example of the current api:

Installation

git clone -b experiment https://github.com/gabrielchua/RAGxplorer.git
cd RAGxplorer
virtualenv venv # create a new virtual env
source venv/bin/activate # activate the virtual env
pip install -r requirements.txt

Usage

from ragxplorer.ragxplorer import Explorer
client = Explorer(embedding_model="text-embedding-ada-002") # Please ensure "OPENAI_API_KEY" is set as an env variable
client.load_document("presentation.pdf")
client.visualise_query("What are the top revenue drivers for Microsoft?")

It’s now a package on PyPi :slight_smile:

The latest OpenAI embedding models are supported too!

pip install ragxplorer
2 Likes