Embeddings: Converting a embedded vector back to natural language?

The OpenAI documentation on this I find is missing huge amounts of information.

I’ve had to piece bits and pieces together from other sources, but still cannot work out how to convert an embedded vector back to natural language.

I am not necessarily asking for code, although if someone has an example (in PHP) that would be amazing.

I’ve been able to create a PHP script that compares:

  1. a user input query (as vector created by sending a request to OpenAIs embeddings API) and
  2. search text (as vector created by sending a request to OpenAIs embeddings API) also

Then find the most similar vectors.

But, what’s the next step to convert the “found” similar search text vector back to natural language to deliver to the user query?

Embeddings cannot be converted back to language. You have to store the original content (or link to) alongside the embedding.

2 Likes

Interesting! Thanks for explaining. Was this in the docs and I missed it?

Not sure. Doc authors might have just considered it implicit. The way you interpreted embeddings was as a compressed version of the original. However, they’re actually just representations of the meaning.

Here in lies the problem with most API Documents :slight_smile:

It’s almost like they want to make it hard for people to use their API and give them money!

I assume this is how Embeddings work and that OpenAI is only needed to Embed and process the match between user prompt and similar chunks? I.E. steps 2, 4 & 8:

  1. Take blog posts and cut them up into smaller parts at about a maximum of 512 tokens - this is called “chunks” of natural language.
  2. Embed each natural language chunk via OpenAIs “embeddings” API - this creates an “embedded vector” which is a list of numbers that represent the natural language “chunks”.
  3. Take each natural “chunk” that has been converted to an embedded vector and insert them into a vector database with the related natural language chunk indexed against it.
  4. Create a front-end app that takes a user prompt and embeds it, again via OpenAIs “embeddings” API.
  5. Get the front-end app to take the embedded vector for the user prompt and search it against all embedded vectors in the vector database.
  6. Use a mathematical cosine similarity function to measure how similar the user prompt is compared to all natural language chunks in the database - this is done by comparing embedded vectors, not natural language chunks.
  7. When the best match in terms of similarity between the user prompt as an embedded vector and all chunks as embedded vectors are found, get the front-end app to find the natural language chunk index against it.
  8. Get the front-end app to deliver back the found natural language chunk to OpenAI via the GPT 3.5 chat model this includes system message, original user prompt and relevant blog post chunk.
  9. Get the front-end to take the output of the GPT 3.5 chat model and deliver it to the user.
1 Like

I think it is useful to point out bc I found this page through web search, that it may not be possible for devs to invert embeddings with open ai api, however it very much is possible to do in general. See “Text Embeddings Reveal (Almost) As Much As Text” from Cornell, 2023.

1 Like