Im trying to figure out how I can take past conversation data and either fine-tune my own embeddings model on that data, use an existing embeddings model (like openai ada or something) or train my own embeddings model to embed data into a vector database that can be queried via semantic search to retrieve pieces of context for a system prompt.
To my knowledge most embedding models are designed for queries like Q & A answering, but im unsure how I would take unstructured past conversation data and embed it efficiently? Would I embed the entire conversation? Just question and answers? Paragraphs? really not sure what would be the best way to do this…
I’ve had a look around I dont seem to be able to find many fine tuned embedding models for this specific usecase. Im also conflicted what kind of semantic searching would be best for this usecase - Dot product or Cosine similarity? or is both possible?
Be curious to see what others think!
I ripped this out of this presentation: https://d223302.github.io/AACL2022-Pretrain-Language-Model-Tutorial/lecture_material/AACL_2022_tutorial_PLMs.pdf
What an embedding model is trying to do, is to directly tap the internal semantic encoding of an LLM.
The idea is that similar concepts probably have similar vectors. The industry has settled on cosine similarity (which is just normalized dot product), you can do dot product, you can do euclidean distance, and you can theoretically probably even do substitution only levenshtein distance or something if you really wanted to.
The whole point is that the model can understand the semantic meaning behind whatever syntactic structure you throw at it. With multimodal embeddings, for example it wouldn’t even matter whether it was a text description, an image, or both that you sent into it. If they describe or depict the same thing, their embedding vectors should be extremely close.
So what you do with the embeddings is up to you. They’re incredibly powerful, particularly with unstructured text.
RAG (most Q&A) is just one popular and easy to understand use of embeddings.
- dont worry about fine tuning, just use ada
- consider embedding summaries or synopses, but it depends on what exactly you’re trying to achieve and how you’ll retrieve them (if helps if the retrieval text is simlar to the stored text in format)
- just use cosine similarity.
PS.: Rip Davinci. Why.
I have a dataset of conversation history that also has metadata in it, can ada handle that? Or will I have to clean it up.
thanks for your super quick response btw!
it may work out of the box, but it may depend on your volume and how you’re gonna wanna retrieve it. HyDE is currently a popular approach.
does the metadata contribute anything to the semantic understanding of the content?
The metadata contains the timestamp of user and bot answer and response along with the date, that’s about it.
What is your end goal? I mean what is your use case? You’re wanting to leverage information users had provided during prior conversations to make that knowledge “usable” for answering future questions?
In short, yes. I want to use information users had provided during prior conversations and use that knowledge to continually use gathered context to improve interactions with a user.
The tricky part is that you have no way using embeddings to distinguish between a question about a topic v.s. a statement of fact about a topic. Also even if, theoretically, every statement ever made during a conversation was indeed a statement of fact you have no way of knowing which are erroneous inputs from users and which are reliable and true statements made by users. So it’s virtually impossible to “mine new information” out of any prior conversations that didn’t originate in the LLM itself to begin with.
EDIT: However, you can use embeddings search simply to “find” prior public conversations had, and then perhaps offer end users a way to fully explore and view the entire prior conversations (like StackOverflow, does so to speak)
timestamp and date is probably not gonna do much for the embedding.
what do you want to get out of it?