Embedding distribution

heiko · December 12, 2022, 6:24pm

Hi Guys
I am still trying to get my head around embeddings.
In the code below I created embeddings for a list of (test)strings of whch some are identical. But the resulting plot shows a distribution of the embeddings that I dont quite understand - I would have at least expected a cluster of points of the identical strings.
What am I missing?

plot

from openai.api_resources import embedding
import matplotlib.pyplot as plt
import os
import openai
from sklearn.manifold import TSNE

openai.api_key = "■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■fXJq5Aph"

input_strings=[
  "king",
  "queen",
  "castle",
  "castle",
  "castle",
  "castle",
  "castle",
  "castle",
  "castle",
  "rocket",
  "moon",
  "accountant",
  "finance"
]


def get_embeddings(strings):
  count=0
  return_list=list()
  for string in strings:
    response = openai.Embedding.create(
      model="text-search-ada-doc-001",
      input=string
    )
    embeddings=response['data'][0]['embedding'] 
    #print(embeddings)
    return_list.append(embeddings)
  return (return_list)


embeddings_list=get_embeddings(input_strings)

tsne = TSNE(n_components=3)
reduced_embeddings = tsne.fit_transform(embeddings_list)

# create a figure and axis

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
# loop through the list of reduced-dimensional embeddings
for embedding in reduced_embeddings:
    # plot the 2D embedding on the axis
    ax.scatter(embedding[0], embedding[1], embedding[2])

# show the plot
plt.show()

lmccallum · December 12, 2022, 9:52pm

I suspect if you zoomed in on that graph so that the scale was more detailed, you would indeed find that identical words cluster close together. Also, there may always be a small difference in the embeddings for identical words. I am not sure why, I just believe it to be true based on things I’ve read. Perhaps a more technical person can explain why.

heiko · December 12, 2022, 9:56pm

Scaling is an interesting thought.

heiko · December 13, 2022, 6:02pm

hm. not sure if its scaling.
I created the embeddings for these strings now

 "castle",
  "king",
  "king"

and the resulting 2d plot is
I tested with some more and it somehow seams the points are just evenly distributed over the space. Any Idea what im doing wrong?
2dplot

Topic		Replies	Views
Semantic search through embeddings API	3	1356	January 22, 2023
Inconsistent embedding result with same input API	4	1132	December 24, 2023
Semantic Textual Similarity - undifferentiated similarities API embeddings , semantic-search	5	1605	December 24, 2023
Why is Openai Embeddings API returning multiple vectors for one very long string? API	3	1437	December 18, 2023
Embeddings for the same content vary. How can this be fixed? API embeddings	5	1013	August 9, 2025

Embedding distribution

Related topics