Embeddings: Converting a embedded vector back to natural language?

nerrosa · May 11, 2023, 4:12am

The OpenAI documentation on this I find is missing huge amounts of information.

I’ve had to piece bits and pieces together from other sources, but still cannot work out how to convert an embedded vector back to natural language.

I am not necessarily asking for code, although if someone has an example (in PHP) that would be amazing.

I’ve been able to create a PHP script that compares:

a user input query (as vector created by sending a request to OpenAIs embeddings API) and
search text (as vector created by sending a request to OpenAIs embeddings API) also

Then find the most similar vectors.

But, what’s the next step to convert the “found” similar search text vector back to natural language to deliver to the user query?

wfhbrian · May 11, 2023, 12:05pm

Embeddings cannot be converted back to language. You have to store the original content (or link to) alongside the embedding.

nerrosa · May 11, 2023, 12:41pm

Interesting! Thanks for explaining. Was this in the docs and I missed it?

wfhbrian · May 11, 2023, 1:02pm

Not sure. Doc authors might have just considered it implicit. The way you interpreted embeddings was as a compressed version of the original. However, they’re actually just representations of the meaning.

nerrosa · May 11, 2023, 1:16pm

Here in lies the problem with most API Documents

It’s almost like they want to make it hard for people to use their API and give them money!

nerrosa · May 12, 2023, 11:48am

I assume this is how Embeddings work and that OpenAI is only needed to Embed and process the match between user prompt and similar chunks? I.E. steps 2, 4 & 8:

Take blog posts and cut them up into smaller parts at about a maximum of 512 tokens - this is called “chunks” of natural language.
Embed each natural language chunk via OpenAIs “embeddings” API - this creates an “embedded vector” which is a list of numbers that represent the natural language “chunks”.
Take each natural “chunk” that has been converted to an embedded vector and insert them into a vector database with the related natural language chunk indexed against it.
Create a front-end app that takes a user prompt and embeds it, again via OpenAIs “embeddings” API.
Get the front-end app to take the embedded vector for the user prompt and search it against all embedded vectors in the vector database.
Use a mathematical cosine similarity function to measure how similar the user prompt is compared to all natural language chunks in the database - this is done by comparing embedded vectors, not natural language chunks.
When the best match in terms of similarity between the user prompt as an embedded vector and all chunks as embedded vectors are found, get the front-end app to find the natural language chunk index against it.
Get the front-end app to deliver back the found natural language chunk to OpenAI via the GPT 3.5 chat model this includes system message, original user prompt and relevant blog post chunk.
Get the front-end to take the output of the GPT 3.5 chat model and deliver it to the user.

valsk · January 14, 2024, 1:44am

I think it is useful to point out bc I found this page through web search, that it may not be possible for devs to invert embeddings with open ai api, however it very much is possible to do in general. See “Text Embeddings Reveal (Almost) As Much As Text” from Cornell, 2023.

Topic		Replies	Views
Converting embedding vector to text API embeddings	16	18186	July 24, 2024
Does openAI provide API that takes Embeddings as an input? API embeddings	10	4127	December 18, 2023
Is it possible to convert vector to text? API chatgpt	27	14890	June 1, 2024
Questions about embeddings API	1	1329	October 16, 2023
OpenAI Embeddings - Search through ~1000 PDFs API embeddings	3	3301	August 28, 2024

Embeddings: Converting a embedded vector back to natural language?

Related topics