Since the context/memory of a chat or question for LLMs more precisely GPT is limited to a token length I struggle about how to provide own data that the model got not trained on. A very common approach looks like embeddings are the way to.
Here I do struggle, since it might be very well possible that even we find the most matching documents locally in a vector database, context might still be too small if we would like to provide multiple matches.
The question to me is, how could I send all the relevant embedding vectors rather than the relevant texts which got matched to the vectors? These vectors are highly condensed and would save a lot of tokens. GPT would anyhow be able to understand the vector since they created it from their embeddings API, right?
Or is it just not possible to convert the vector back to text at their end?
Thanks in advance for any help and explanations to understand this better.
I’m only a student (13 yrs old), so I may be wrong.
As far as I know, you’re right that the GPT transformer is an encoder-decoder model that uses word embeddings at its core, and that the transformer architecture is designed to encode semantically information about text. However, you might also know that GPT is just a next-word-predictor and that, in order to generate embeddings, it would need to “peek inside its own code.” Essentially, the code uses the embeddings, but the chat interface itself, when predicting the next word, doesn’t have the ability to call the embeddings model.
Just a thought, though - if you created a plugin that allowed ChatGPT to access the OpenAI embeddings model, let it call the embeddings model and “teach itself” what the embeddings meant, and then inputted your query outputs as embeddings, ChatGPT might have built up sufficient understanding such that it could understand what your embeddings meant. The main disadvantage of this is that it will become harder and harder for ChatGPT to understand the embeddings model as the dimensionality increases, so you’ll be stuck with the less powerful models. However, if you can get access to plugins and you really think the embeddings context will benefit your task, feel free try it out!
I may understand your perspective, but the compressed vectors generated by OpenAI’s fine-tuned models are only understandable by the local model. As for the chat content on the ChatGPT website, it is not recognized by the model.
You can send multiple embedding matches by concatenating all the top hits and stuff them into the prompt, being mindful of the max_tokens available for the model you are using. You should not be limited to the contents behind the single top embedding vector. So you need to take the “top N” embeddings, and not the single “top 1” embedding.
This is an area that I get pretty excited about. There are many ways to utilize bundles of relevant similarities to craft a well-performing application. Top three, or top five, and other metrics like the deviation of any of the top hits within the cluster all serve as jumping-off points to do some clever stuff.
On the PaLM 2 side, I love the added clarity when you can get three “candidate” results without added cost or latency. Google must have some pretty powerful parallel processing going on. Keyword extraction processes can merge candidates to get a much more complete picture, for example.
One approach might be to generate vector embeddings of more granular sections of past responses or documents. Maybe sending, say, a dozen full responses/documents back to the model would exceed its context window. But the dozen most relevant paragraphs (or even sentences) might be just fine. In other words, take advantage of your vector db’s ability to efficiently index and retrieve a large number of embeddings representing relatively small fragments of documents, and send the most relevant relatively small pieces along to the model.
There are a few ways here. First, given the target LLM you plan on using, determine how many tokens max you want out, and then how many different top hits you want to present to the prompt, and this will determine your chunk size for embedding.
For example. Suppose you are using one of the 4k models … so DaVinci, GPT-3.5-Turbo, etc. And you allow for 1k output. So you now have 3k left for input. If you want the embeddings to retrieve the top 3, then you chunk in 1k increments. If you want the top 6, you chunk in 500 token increments.
For the most part, 500 tokens will contain at least one entire thought. When you go lower and lower in chunk size, you risk having “fragmented thoughts” and non-coherent output from the LLM. So you need to balance this as well.
For example, imagine the extreme case of embedding each word. Then pull in all “top N” words, you will get a jumbled mess in the prompt, and bizarre output.
So it’s balance of LLM utilization (Input/Output). Thought cohesiveness, etc.