Embedding distribution

Hi Guys
I am still trying to get my head around embeddings.
In the code below I created embeddings for a list of (test)strings of whch some are identical. But the resulting plot shows a distribution of the embeddings that I dont quite understand - I would have at least expected a cluster of points of the identical strings.
What am I missing?


from openai.api_resources import embedding
import matplotlib.pyplot as plt
import os
import openai
from sklearn.manifold import TSNE

openai.api_key = "■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■fXJq5Aph"


def get_embeddings(strings):
  for string in strings:
    response = openai.Embedding.create(
  return (return_list)


tsne = TSNE(n_components=3)
reduced_embeddings = tsne.fit_transform(embeddings_list)

# create a figure and axis

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
# loop through the list of reduced-dimensional embeddings
for embedding in reduced_embeddings:
    # plot the 2D embedding on the axis
    ax.scatter(embedding[0], embedding[1], embedding[2])

# show the plot

1 Like

I suspect if you zoomed in on that graph so that the scale was more detailed, you would indeed find that identical words cluster close together. Also, there may always be a small difference in the embeddings for identical words. I am not sure why, I just believe it to be true based on things I’ve read. Perhaps a more technical person can explain why.

1 Like

Scaling is an interesting thought.

hm. not sure if its scaling.
I created the embeddings for these strings now


and the resulting 2d plot is
I tested with some more and it somehow seams the points are just evenly distributed over the space. Any Idea what im doing wrong?