We’re rolling out Embeddings to all API users as part of a public beta.
Our Embeddings offering combines a new endpoint and set of models to address more advanced search, clustering, and classification tasks. The /embeddings endpoint returns a vector representation of the given input that can be easily consumed by machine learning models and algorithms.
We are releasing three sets of models:
Text Similarity: excels in capturing semantic similarity between pairs of text
Text Search: excels in finding relevant documents for a query among a collection of documents
Code Search: excels in finding relevant code blocks for a natural language query.
Please read our in-depth guide for examples on how to apply embeddings to different use cases, and refer to the API reference for more details on how to query the endpoint.
The /embeddings endpoint is offered for free through the end of 2021. If you have any questions, issues, or general feedback, or if you would like to use embeddings for academic or research purposes, please contact embeddings@openai.com
Has the embeddings endpoint been pulled down? I started using it yesterday, and today, there is no hint of documentation or guide anywhere to be found on the site.
Usage stats from yesterday are also gone. The code I wrote still works, likely to stop soon?
This is excellent. I switched over the search for a project I’m working on to these embeddings and it’s so much better. Even with just Ada. I was using Glove before.
One feature request would be to have access to these embedding layers for fine-tuned models too. If that’s even possible.
This is awesome. What’s the meaning of the different indices?
i.e. response["data"][0] vs response["data"][1]
Maybe which tokens embeddings are used? i.e. 0 being the last, 1 one before last…
(couldnt find it in the guide)
Close! We only return embeddings for a input, considered as a whole. The length of the embeddings won’t change based on the length of the input passed in.
There should be one entry in the “data” array for each input you’ve submitted in the request. To figure out which embedding maps to which input, look at the “index” field for that entry.
You can see the index field in the response object near this doc: OpenAI API
Oh I see - Is sending tokens already supported?
Sending an array of token arrays as stated here OpenAI API
yields:
response = openai.Engine(id="babbage-similarity").embeddings(
input=[['Sample', 'Ġdocument', 'Ġtext', 'Ġgoes', 'Ġhere']],
)
embeddings = response['data'][0]['embedding']
InvalidRequestError: [['Sample', 'Ġdocument', 'Ġtext', 'Ġgoes', 'Ġhere']] is not valid under any of the given schemas - 'input'
We support arrays of token arrays. Tokens are ints. In this case you’ve submitted an array of array of strings. For strings, we support a single string and an array of strings. The following might be what you want:
What does “embedding” mean in openai context? Why this choice of naming for this? What is difference between “embedding” and “non-embedding”? Can you tell what would be the opposite of “embedding”? I don’t understand the name for this, what should I think under this term. Is it only a property/method of model or it something like reward/food for model?
Embedding is a term of trade in the AI/ML field.
In simple terms, an Embedding is a multi-dimensional numeric representation of a concept, that contains a distillation of all its semantics(properties).
For example, the redness, roundness, sweetness of an apple are 3 properties that can be numerically expressed and together will constitute a 3 dimensional embedding of apple. Think of it as a point (vector) in 3 d space. The sphere around it shall be the things that are close to apple.
Just expand this to many more dimensions, and consider that the individual dimension is not human interpretable (as in sweetness, redness etc. above), and you have your 1024/2048/4096 etc dimensional embeddings.
The idea of “concept” being a synonym of “embedding” is genius, but only in this context. In general, it may not be so. As an example of the top of my head, in a collaborative filtering context for recommendations, the users’ embeddings may be synonymous with preferences. In a simple tf-idf model, the discovered embedding are (roughly) the normalized relative frequencies of terms.
However, given that the question was about the OpenAI GPT3 embeddings endpoint, I agree with your interpretation.
Hi
when i tried the snippet
def get_embedding(text, engine=“davinci-similarity”):
text = text.replace("\n", " ")
return openai.Engine(id=engine).embeddings(input = [text])[‘data’][0][‘embedding’]
get_embedding(‘Its a car’)
i got an error saying
/usr/local/lib/python3.7/dist-packages/openai/api_requestor.py in _interpret_response_line(self, rbody, rcode, rheaders, stream)
317 if stream_error or not 200 <= rcode < 300:
318 raise self.handle_error_response(
→ 319 rbody, rcode, resp.data, rheaders, stream_error=stream_error
320 )
321 return resp
InvalidRequestError: Engine not found
why this happening? the same is for babage-similarity, curie-similarity etc…
but im getting response from snippet
response = openai.Completion.create(engine=“davinci”, prompt=“This is a test”,max_tokens=5)
so it seems like having some issue with getting the embedding.
Thanks in advance