Visualizing chatgpt embeddings using UMAP

Mayank11 · October 5, 2023, 5:34pm

I have embedded about 20k short texts using text-embedding-ada-002 and I am trying to visualize the embeddings in 2D using UMAP. However the results were not what I was expecting. I tried different values for n_neighbors, min_dist params and ‘cosine’ for metric parameter. I think min_dist param is not applied by UMAP properly as I still see lot of overlapping samples in the lower dimension. Is there a recommended min_dist / n_neighbor value for visualizing chatgpt embeddings using UMAP properly?

Any help is appreciated.

emb3d.co · October 7, 2023, 5:59am

This may not be a problem with your umap hyperparameters but a visualization issue. This article lists some of the plotting pitfalls with large data. At 20K records, it will be very hard to see meaningful patterns due to issues like overplotting. Some form of aggregation may be needed even in 2D. Have you tried clustering your data?

If you already have domain knowledge of the short texts, you could try to see if the clusters match your expectation.

Topic		Replies	Views
Developing a solution to identify context overlap in documents Community gpt-4	4	186	February 20, 2025
How I cluster/segment my text after embeddings process for easy understanding? API	13	14448	December 18, 2024
Embedding testing with different models API chatgpt	2	2930	December 24, 2023
Generate embedding for a collection using individual embeddings Community embeddings , gpt-35-turbo	2	501	October 30, 2023
Issue with embeddings endpoint API	2	4077	February 10, 2022

Visualizing chatgpt embeddings using UMAP

Related topics