Im trying to figure out how I can take past conversation data and either fine-tune my own embeddings model on that data, use an existing embeddings model (like openai ada or something) or train my own embeddings model to embed data into a vector database that can be queried via semantic search to retrieve pieces of context for a system prompt.
To my knowledge most embedding models are designed for queries like Q & A answering, but im unsure how I would take unstructured past conversation data and embed it efficiently? Would I embed the entire conversation? Just question and answers? Paragraphs? really not sure what would be the best way to do this…
I’ve had a look around I dont seem to be able to find many fine tuned embedding models for this specific usecase. Im also conflicted what kind of semantic searching would be best for this usecase - Dot product or Cosine similarity? or is both possible?
What an embedding model is trying to do, is to directly tap the internal semantic encoding of an LLM.
The idea is that similar concepts probably have similar vectors. The industry has settled on cosine similarity (which is just normalized dot product), you can do dot product, you can do euclidean distance, and you can theoretically probably even do substitution only levenshtein distance or something if you really wanted to.
The whole point is that the model can understand the semantic meaning behind whatever syntactic structure you throw at it. With multimodal embeddings, for example it wouldn’t even matter whether it was a text description, an image, or both that you sent into it. If they describe or depict the same thing, their embedding vectors should be extremely close.
So what you do with the embeddings is up to you. They’re incredibly powerful, particularly with unstructured text.
RAG (most Q&A) is just one popular and easy to understand use of embeddings.
TL;DR:
dont worry about fine tuning, just use ada
consider embedding summaries or synopses, but it depends on what exactly you’re trying to achieve and how you’ll retrieve them (if helps if the retrieval text is simlar to the stored text in format)
What is your end goal? I mean what is your use case? You’re wanting to leverage information users had provided during prior conversations to make that knowledge “usable” for answering future questions?
In short, yes. I want to use information users had provided during prior conversations and use that knowledge to continually use gathered context to improve interactions with a user.
The tricky part is that you have no way using embeddings to distinguish between a question about a topic v.s. a statement of fact about a topic. Also even if, theoretically, every statement ever made during a conversation was indeed a statement of fact you have no way of knowing which are erroneous inputs from users and which are reliable and true statements made by users. So it’s virtually impossible to “mine new information” out of any prior conversations that didn’t originate in the LLM itself to begin with.
EDIT: However, you can use embeddings search simply to “find” prior public conversations had, and then perhaps offer end users a way to fully explore and view the entire prior conversations (like StackOverflow, does so to speak)