Hello.
I have developed Chatbot system with text-davincci-002 model and Pinecone vector database.
However, there is need to convert vector from pinecone vector database to text.
In other words, reverse of embedding.
I have used text-embedding-ada-002 for embedding.
Is it possible?
Hi,
There is no 1:1 mapping of input text to embedding vector, in theory if your text contained less bytes of information than was stored in the vectorā¦ the information might be in there somewhere, but as the current state of the art stands, nobody knows how to reverse a vector to the text that created it.
Typically you would store the text which created the embedding with the embedding vector, otherwise I am not quite sure what the point would be of having a bunch of embedding vectors without their associated text.
You donāt need to reverse the vector to the text that created it just to some text that will re-generate the vector.
I imagine that a model can be trained to do this, but that model hasnāt yet been built. It also needs to be trained on the specific embedding space of the model used to generate the embeddings in the first place.
You can take the training data used to generate an embedding encoder, run through to get the embedding vectors, and then transpose the dataset, to go to a decoder that goes from embedding vector to text, and run training on that, but current transformers use token embeddings, not sentence/paragraph embeddings, so thatād be an interesting experiment.
Ummā¦ Unless youāve got a spare quantum computer or two laying about I donāt think this is remotely feasible.
If you look at the pinecone examples (Google Colab) and/or the langchain source code for the Pinecone.from_texts method (or maybe itās a function, I donāt actually know), the way to achieve this is to add a metadatas field with key ātextā that your code passes the original pre-embedding text chunk when you upsert your vectors.
Then, on vector retrieval, you specify include_metadata=True
to get back the original text with your retrieved vectors as part of the JSON payload, at which point you can tie the results together.
If you truly need to reverse-embed to extract the text back from the embeddings, I donāt suspect thatās possible because the embedding vector captures semantic relationships between the words in your original text, but it isnāt a word-for-word encoding with positional references at all. For example, ada-002 can encode up to 8196 tokens of text into a 1536 dimensional vector of (I think) 16-bit floats. So it isnāt actually storing words, just numerical concept valuesā¦ of some kind.
If you were able to map the latent space and explore it, perhaps you could then build up key parts of the original sentiment, but the task seems insurmountable with what we have to work with currently. Maybe an AI could learn to do it.
Iām going to lean in hard on the quantum computer angle.
If youāll pardon my rather hasty back-of-the-envelope math, there are on the order of 1E10^24500 possible embedding vectors.
I very briefly some weeks back had a somewhat similar thought about embeddings as compression.
Say you had 8k-tokens worth of context, if you could identify 1000, seemingly random, tokens which had, essentially, the same embedding vector, would you be able to send them as context and continue the chat without significant loss of fidelity. I also wonder if this might be partially what is happening in some of those ārandomā token exploits.
Thanks for your kind reply.
From your replies, I got it is impossible to reverse embed from vector to text.
I posted this topic because I have gained reduced dimension from 1536 vectors.
I need to generate text from this and feed into context for Q&A.
And I think Reverse embedding is possible.
Thatās because embedding model generates the same vectors from the same text.
Iām not familiar with principle of openai embedding model, but if we know about it, itās possible.
If those are embeddings you just created, implying that you have the text they were generated on readily at hand, then why not add the text as metadata as suggested before?
Or, alternatively add another identifier to the embedding and retrieve the original text from another database?
I am asking because the problem appears somewhat off. Or, are you trying to get the text from embeddings that you havenāt got access to?
If youāre OK with generating text that gets close to the vector, itās totally possible. Worst case, you build and train a totally separate LLM that infers from 1536-vector to text. The good news is that the training data for this would be easy to generate, as itās just the transpose of the embedder training data
If I understand correctly, people on the image side already do this: they go from semantic text-based embeddings to image latents to pixels.
I have an idea for a friendly challengeāif youāre up for a game.
We set some ground rules first:
- A maximum number of tokens to encode (Iām thinking 200ā500?), and
- A distance measure and minimum distance (Iām thinking cosine similarity and just to make it easy a score above 0.8?)
Then play proceeds as follows,
- Iāll pick a piece of text, encode it using
text-embedding-ada-002
, and post a pastebin link with the embedding values. - You will create your embedding vectors, find one with a score greater than the threshold, then post your near-text and a pastebin link to your embedding values.
- I will post my original text.
- We can thenāfor funsiesādetermine how closely your close-text comes to the essence of the original.
By way of example,
The Road Not Taken by Robert Frost (175 tokens)
Two roads diverged in a yellow wood,
And sorry I could not travel both
And be one traveler, long I stood
And looked down one as far as I could
To where it bent in the undergrowth;
Then took the other, as just as fair,
And having perhaps the better claim,
Because it was grassy and wanted wear;
Though as for that the passing there
Had worn them really about the same,
And both that morning equally lay
In leaves no step had trodden black.
Oh, I kept the first for another day!
Yet knowing how way leads on to way,
I doubted if I should ever come back.
I shall be telling this with a sigh
Somewhere ages and ages hence:
Two roads diverged in a wood, and Iā
I took the one less traveled by,
And that has made all the difference.
GPT-4 re-write (263 tokens)
Prompt: Please re-write the Robert Frost poem The Road Not Taken as a piece of straight prose.
In the midst of a yellow-hued forest, I found myself at a divergence of two paths. I was filled with regret, as I, being but a single person, could not embark on both routes. I remained there for a while, studying one path as far as my eyes could reach until it disappeared into the dense underbrush.
Eventually, I chose the other road. It appeared just as enticing, possibly even more so because it had the allure of being less frequented, as evidenced by its lush grassiness. However, upon closer inspection, the signs of passage on both paths seemed to be rather similar.
That particular morning, both paths lay undisturbed, blanketed in leaves untouched by any traveler. I made a mental note to save the first path for a different day, but a nagging doubt clouded my optimism. I knew from experience how one path often leads to another, leaving me uncertain if Iād ever return to this initial fork.
Many years from now, I imagine Iāll recount this tale, likely with a wistful sigh. I will tell of the time I stood before two diverging roads in a woodland setting. I made the choice to journey down the path less traveled, and that decision, in its own way, has profoundly affected the course of my life.
The two pieces of text above have a cosine similarity score of 0.9318
What do you sayā¦
The gauntlet has been thrown down. Could get super interesting.
Ooh, kind of like āvector hangmanā. Better yet, prompt large language models to play, and see which models achieve the highest similarity within 20 guesses!
Just FYI for some perspective, there are on the order of 1e24500 possible embedding vectors but on the order of 1e40900 possible text strings which can be embedded using text-embedding-ada-002
.
Meaning there are (on average) on the order of 1e16400 different text strings which point to the same embedding vector for every possible embedding. So, thereās effectively infinite text strings which have the same embedding, but the probability of finding one at random is effectively zero.
Now, itās well within our capabilities today to find a set of 1535 text inputs which, when we construct embeddings for them, resolve to vectors which form a basis for the embedding space.
We could even then, express our target embedding as a linear combination of our basis vectors.
What would indeed be absolutely amazing is if we could then feed in the 1535 original input texts with the desired linear combination and ask a LLM to create a text string which is the proper combination of the basis texts.
If you kept all of your input texts to under 15 or 20 tokens, you could probably fit the question into gpt-4-32k, but given how bad the model is at math I think it would have quite a bit of difficulty properly mixing the prompts.
But, Iāll be happy to be proven wrong!
This would be fun to play with!
Sadly, I seem to have misplaced all of my spare time. I thought it was just here?
Maybe you can generate a db vector with a dictionary. Then word by word you can find some similarity with nearest distance.
This is a super interesting problem. A paper was recently posted online that asks exactly the same question: Text Embeddings Reveal (Almost) As Much As Text.
Seems like the authors trained a model that does this pretty well, and their iterative approach Vec2Text can get the exact text back some of the time. Apparently building a system that can make multiple text guesses and get ācloserā to the true embedding works a lot better than naively training a LM to map embeddings back to text.
It is if you had a VAE or Variational Autoencoder. This VAE has an encoder and a decoder.
So the encoder:
Input Data ā Vector
And the decoder:
Vector ā Output Data
The input data could be whatever the model was trained on, so text, images, etc.
This type of thing can be used to de-noise images ā¦ so itās a real thing.
But here, even though it is theoretically possible, it isnāt possible for us, since OpenAI hasnāt exposed any sort of decoder for the embedding model.
But the application of this seems pretty weak. Because you can already āde-noiseā your text and reconstruct it through correlation and embeddings. Having an unknown decoder from OpenAI might put out weird stuff that you donāt want anyway [remember, itās ātrained on vast amounts of data, including random internet dataā].
If you have a strong use case for an OpenAI decoder (for the embedding model), let me know!
Also, your order-of-magnitude arguments donāt perfectly apply because most sequences of tokens arenāt textlike at all. The space of natural language strings is much smaller, and might be small enough to map 1:1 to embedding vectors