Introducing Embeddings

hallacy · December 2, 2021, 10:06pm

Hi all!

We’re rolling out Embeddings to all API users as part of a public beta.
Our Embeddings offering combines a new endpoint and set of models to address more advanced search, clustering, and classification tasks. The /embeddings endpoint returns a vector representation of the given input that can be easily consumed by machine learning models and algorithms.

We are releasing three sets of models:

Text Similarity: excels in capturing semantic similarity between pairs of text
Text Search: excels in finding relevant documents for a query among a collection of documents
Code Search: excels in finding relevant code blocks for a natural language query.

Please read our in-depth guide for examples on how to apply embeddings to different use cases, and refer to the API reference for more details on how to query the endpoint.

The /embeddings endpoint is offered for free through the end of 2021. If you have any questions, issues, or general feedback, or if you would like to use embeddings for academic or research purposes, please contact embeddings@openai.com

vaibhav.garg · December 4, 2021, 3:23am

Has the embeddings endpoint been pulled down? I started using it yesterday, and today, there is no hint of documentation or guide anywhere to be found on the site.

Usage stats from yesterday are also gone. The code I wrote still works, likely to stop soon?

hallacy · December 4, 2021, 3:57am

Well that ain’t great. I don’t believe we intended this to happen.

Thanks for the heads up! We’re looking into it.

The embedding endpoints should still be functioning so if you have running code it should still work.

bram · December 4, 2021, 4:04am

I’ll be the first to say it: Text Similarity is my favorite part about GPT-3 <3

hallacy · December 4, 2021, 4:28am

And we’re back! Sorry about that.

vaibhav.garg · December 4, 2021, 6:05am

Confirmed. Thanks a lot!

That’s an amazing TAT. Kudos.

gpriday · December 8, 2021, 8:34am

This is excellent. I switched over the search for a project I’m working on to these embeddings and it’s so much better. Even with just Ada. I was using Glove before.

One feature request would be to have access to these embedding layers for fine-tuned models too. If that’s even possible.

hallacy · December 10, 2021, 5:52pm

We’re hoping to support embeddings for fine-tuned models at some point in the future but don’t have an expected release date

Muennighoff · December 13, 2021, 5:56am

This is awesome. What’s the meaning of the different indices?
i.e. response["data"][0] vs response["data"][1]
Maybe which tokens embeddings are used? i.e. 0 being the last, 1 one before last…
(couldnt find it in the guide)

hallacy · December 13, 2021, 6:02am

Close! We only return embeddings for a input, considered as a whole. The length of the embeddings won’t change based on the length of the input passed in.

There should be one entry in the “data” array for each input you’ve submitted in the request. To figure out which embedding maps to which input, look at the “index” field for that entry.

You can see the index field in the response object near this doc: OpenAI API

Muennighoff · December 13, 2021, 6:13am

Oh I see - Is sending tokens already supported?
Sending an array of token arrays as stated here OpenAI API
yields:

response = openai.Engine(id="babbage-similarity").embeddings(
    input=[['Sample', 'Ġdocument', 'Ġtext', 'Ġgoes', 'Ġhere']],
)
embeddings = response['data'][0]['embedding']

InvalidRequestError: [['Sample', 'Ġdocument', 'Ġtext', 'Ġgoes', 'Ġhere']] is not valid under any of the given schemas - 'input'

(using openai==0.11.3)

hallacy · December 13, 2021, 6:37am

We support arrays of token arrays. Tokens are ints. In this case you’ve submitted an array of array of strings. For strings, we support a single string and an array of strings. The following might be what you want:

response = openai.Engine(id="babbage-similarity").embeddings(
    input=['Sample', 'Ġdocument', 'Ġtext', 'Ġgoes', 'Ġhere'],
)

Otherwise you’ll need to tokenize your input.

Muennighoff · December 13, 2021, 6:46am

Great thanks - I was looking for the following

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
response = openai.Engine(id="babbage-similarity").embeddings(
    input=[tokenizer.encode("Sample document goes here", add_special_tokens=False)],
)
embeddings = response['data'][0]['embedding']

karl1 · December 14, 2021, 11:58am

I figured it‘s the stochastic nature of the sampling

newcolor · December 14, 2021, 4:55pm

This is cool! I appreciate the Use Cases provided…seed has been planted in the field of possibility.

fvrlak · December 17, 2021, 12:34am

What does “embedding” mean in openai context? Why this choice of naming for this? What is difference between “embedding” and “non-embedding”? Can you tell what would be the opposite of “embedding”? I don’t understand the name for this, what should I think under this term. Is it only a property/method of model or it something like reward/food for model?

vaibhav.garg · December 17, 2021, 1:17am

Embedding is a term of trade in the AI/ML field.
In simple terms, an Embedding is a multi-dimensional numeric representation of a concept, that contains a distillation of all its semantics(properties).

For example, the redness, roundness, sweetness of an apple are 3 properties that can be numerically expressed and together will constitute a 3 dimensional embedding of apple. Think of it as a point (vector) in 3 d space. The sphere around it shall be the things that are close to apple.

Just expand this to many more dimensions, and consider that the individual dimension is not human interpretable (as in sweetness, redness etc. above), and you have your 1024/2048/4096 etc dimensional embeddings.

HTH.

vaibhav.garg · December 17, 2021, 5:30am

Excellent explanation.

The idea of “concept” being a synonym of “embedding” is genius, but only in this context. In general, it may not be so. As an example of the top of my head, in a collaborative filtering context for recommendations, the users’ embeddings may be synonymous with preferences. In a simple tf-idf model, the discovered embedding are (roughly) the normalized relative frequencies of terms.

However, given that the question was about the OpenAI GPT3 embeddings endpoint, I agree with your interpretation.

rramachandra93 · December 17, 2021, 10:15am

Hi
when i tried the snippet
def get_embedding(text, engine=“davinci-similarity”):

text = text.replace("\n", " ")

return openai.Engine(id=engine).embeddings(input = [text])[‘data’][0][‘embedding’]
get_embedding(‘Its a car’)

i got an error saying

/usr/local/lib/python3.7/dist-packages/openai/api_requestor.py in _interpret_response_line(self, rbody, rcode, rheaders, stream)
317 if stream_error or not 200 <= rcode < 300:
318 raise self.handle_error_response(
→ 319 rbody, rcode, resp.data, rheaders, stream_error=stream_error
320 )
321 return resp

InvalidRequestError: Engine not found

why this happening? the same is for babage-similarity, curie-similarity etc…
but im getting response from snippet
response = openai.Completion.create(engine=“davinci”, prompt=“This is a test”,max_tokens=5)
so it seems like having some issue with getting the embedding.
Thanks in advance

DutytoDevelop · December 17, 2021, 10:31am

Welcome to the OpenAI community @rramachandra93!!

Interesting, the code looks correct. A little more information is needed here.

What operating system are you using and what terminal/editor are you using? Did you copy & paste the code in directly?

Topic		Replies	Views
Question on text-embedding-ada-002 API	12	6423	December 24, 2023
Quality of embeddings using davinci-001 embeddings model vs. ada-002 model API embeddings	15	4139	April 9, 2024
Reducing Cost of GPT 4 by using embeddings Prompting	23	10587	May 4, 2023
How to feed data for completions, instead of using prompt/answer fine-tuning format? API	25	17861	December 17, 2023
Introducing ChatGPT and Whisper APIs Announcements whisper	77	19948	December 13, 2023

Introducing Embeddings

Related topics