How can I send vectors as a chat context?

philipHD · May 12, 2023, 8:17am

Since the context/memory of a chat or question for LLMs more precisely GPT is limited to a token length I struggle about how to provide own data that the model got not trained on. A very common approach looks like embeddings are the way to.

OpenAI provided an article openai-cookbook/Question_answering_using_embeddings.ipynb at main · openai/openai-cookbook · GitHub how to create an embedding of a user query, match it against a local vector database ans provide the closest results as text to the context/memory.

Here I do struggle, since it might be very well possible that even we find the most matching documents locally in a vector database, context might still be too small if we would like to provide multiple matches.

The question to me is, how could I send all the relevant embedding vectors rather than the relevant texts which got matched to the vectors? These vectors are highly condensed and would save a lot of tokens. GPT would anyhow be able to understand the vector since they created it from their embeddings API, right?

Or is it just not possible to convert the vector back to text at their end?

Thanks in advance for any help and explanations to understand this better.

842starlite · May 15, 2023, 1:58pm

I’m only a student (13 yrs old), so I may be wrong.

As far as I know, you’re right that the GPT transformer is an encoder-decoder model that uses word embeddings at its core, and that the transformer architecture is designed to encode semantically information about text. However, you might also know that GPT is just a next-word-predictor and that, in order to generate embeddings, it would need to “peek inside its own code.” Essentially, the code uses the embeddings, but the chat interface itself, when predicting the next word, doesn’t have the ability to call the embeddings model.

Just a thought, though - if you created a plugin that allowed ChatGPT to access the OpenAI embeddings model, let it call the embeddings model and “teach itself” what the embeddings meant, and then inputted your query outputs as embeddings, ChatGPT might have built up sufficient understanding such that it could understand what your embeddings meant. The main disadvantage of this is that it will become harder and harder for ChatGPT to understand the embeddings model as the dimensionality increases, so you’ll be stuck with the less powerful models. However, if you can get access to plugins and you really think the embeddings context will benefit your task, feel free try it out!

zsxindu · May 15, 2023, 2:40pm

I may understand your perspective, but the compressed vectors generated by OpenAI’s fine-tuned models are only understandable by the local model. As for the chat content on the ChatGPT website, it is not recognized by the model.

curt.kennedy · May 15, 2023, 2:51pm

You can send multiple embedding matches by concatenating all the top hits and stuff them into the prompt, being mindful of the max_tokens available for the model you are using. You should not be limited to the contents behind the single top embedding vector. So you need to take the “top N” embeddings, and not the single “top 1” embedding.

bill.french · May 15, 2023, 4:37pm

I love your thinking, but vectors are very dense and big as far as I know. Probably bigger than the text they might represent. Right?

bill.french · May 15, 2023, 4:44pm

This is an area that I get pretty excited about. There are many ways to utilize bundles of relevant similarities to craft a well-performing application. Top three, or top five, and other metrics like the deviation of any of the top hits within the cluster all serve as jumping-off points to do some clever stuff.

On the PaLM 2 side, I love the added clarity when you can get three “candidate” results without added cost or latency. Google must have some pretty powerful parallel processing going on. Keyword extraction processes can merge candidates to get a much more complete picture, for example.

curt.kennedy · May 15, 2023, 4:58pm

We normally run multiple results in parallel, not just embeddings, but prompts in general, and let the operator decide the best answer.

So we find it’s best to just fan-out the data to the model, and let the completions roll-in!

dliden · May 15, 2023, 6:06pm

One approach might be to generate vector embeddings of more granular sections of past responses or documents. Maybe sending, say, a dozen full responses/documents back to the model would exceed its context window. But the dozen most relevant paragraphs (or even sentences) might be just fine. In other words, take advantage of your vector db’s ability to efficiently index and retrieve a large number of embeddings representing relatively small fragments of documents, and send the most relevant relatively small pieces along to the model.

curt.kennedy · May 15, 2023, 7:30pm

There are a few ways here. First, given the target LLM you plan on using, determine how many tokens max you want out, and then how many different top hits you want to present to the prompt, and this will determine your chunk size for embedding.

For example. Suppose you are using one of the 4k models … so DaVinci, GPT-3.5-Turbo, etc. And you allow for 1k output. So you now have 3k left for input. If you want the embeddings to retrieve the top 3, then you chunk in 1k increments. If you want the top 6, you chunk in 500 token increments.

For the most part, 500 tokens will contain at least one entire thought. When you go lower and lower in chunk size, you risk having “fragmented thoughts” and non-coherent output from the LLM. So you need to balance this as well.

For example, imagine the extreme case of embedding each word. Then pull in all “top N” words, you will get a jumbled mess in the prompt, and bizarre output.

So it’s balance of LLM utilization (Input/Output). Thought cohesiveness, etc.

Topic		Replies	Views
Best method of injecting relatively large amount of context to be leveraged in a response API	10	10400	December 17, 2023
Can someone make embeddings make sense? (Not what you think, more in post, lets discuss!) API embeddings , gpt-4	6	2137	September 19, 2023
Over-prompting with irrelevant context Prompting embeddings , gpt-4	8	1584	December 17, 2023
Vector database QnA answering based on info from multiple replies Prompting chatgpt	4	2496	September 25, 2023
Embeddings as model input API embeddings , api , prompt	3	2269	June 16, 2023

How can I send vectors as a chat context?

Related topics