Visualizing chatgpt embeddings using UMAP

I have embedded about 20k short texts using text-embedding-ada-002 and I am trying to visualize the embeddings in 2D using UMAP. However the results were not what I was expecting. I tried different values for n_neighbors, min_dist params and ‘cosine’ for metric parameter. I think min_dist param is not applied by UMAP properly as I still see lot of overlapping samples in the lower dimension. Is there a recommended min_dist / n_neighbor value for visualizing chatgpt embeddings using UMAP properly?

Any help is appreciated.

This may not be a problem with your umap hyperparameters but a visualization issue. This article lists some of the plotting pitfalls with large data. At 20K records, it will be very hard to see meaningful patterns due to issues like overplotting. Some form of aggregation may be needed even in 2D. Have you tried clustering your data?

If you already have domain knowledge of the short texts, you could try to see if the clusters match your expectation.

1 Like