Lossy vector to text as compression

Hi,
I understand that the text to vector operation is a one-way operation and that it is not reversible. However, is it possible to convert an embedding into a shorter (could be gibberish) text that could be used in a system prompt to infuse information or context without taking up too many tokens…

In other words, Given a text-to-vector operation:
text_to_vec(“The quick brown fox jumps over the lazy dog”) → [31, 19, …, 62]

Is it possible to do a reverse vector-to-text operation that would yield an arbitrary string:
vec_to_text([31, 19, …, 62]) → some gibberish string like “fjeGUNjef5nJFN”

Where the text-to-vector operation on that string would yield the same (or a very similar) vector to the original one:
text_to_vec (“fjeGUNjef5nJFN”) → [31, 19, …, 62]

The purpose would be to replace a longer string with a shorter one (containing less tokens)

Does this make any sense?

The best inverse you can do, so map a vector to text, is to take the vector, correlate it to previous vectors, find the closest one, and return the text behind this closest vector.

This is a pseudo inverse, not a real inverse.

As for shortening text, why not just have the LLM attempt this. Or you can try extracting keywords (with or without an LLM) and use this as your shortened text.

2 Likes

This spawned an idea…

Have a model generate shortened summary text, compare the embeddings, and repeat until you reach a certain threshold of similarity.

Repeat for a few thousand different bits of text, then fine-tune a model for this type of text compression.

3 Likes