Using embeddings for semantic search on transcripts

enknamel · March 20, 2023, 4:00pm

Hello,

I would like to do semantic search on audio transcripts. I’ve made a few proof of concepts using embeddings and have some questions.

If I embed the whole transcript (which often doesn’t fit) it seems wasteful. If I create embeddings with simple heuristics like sliding windows, it’s hard to optimize for relevancy given how freeform transcripts can be. Is there a method for identifying the optimal text to create an embedding for?
The transcript document quality is a bit low but there is separate metadata I can add to it. Such as who is speaking, or a summary of the conversation to that point, topic labels, etc. Are there any best practices for cleaning text to create an embedding?

wfhbrian · March 20, 2023, 4:12pm

I embed based on who is talking. Each embedding also include metadata like the subject of the meeting. I also skip embedding if what the person said is less than a present number of words.

Something I’ve been thinking about is including both the last thing someone else said, and the following thing, to give more context. But I haven’t tested that yet.

Topic		Replies	Views
Seeking Best Practices for Generating Accurate Embeddings from Video Transcriptions API	3	2244	October 9, 2023
Searching Using Vectors Derived from Long Text Segments in an Embedding Model API embeddings , api	4	2401	December 15, 2023
Embedding past conversation data for context memory & retrieval API	8	2462	January 8, 2024
Questions about the embedding-based chatbot API embedding	4	131	December 15, 2024
Preparing the dataset for embeddings API	10	6139	December 17, 2023

Using embeddings for semantic search on transcripts

Related topics