Using embeddings for semantic search on transcripts


I would like to do semantic search on audio transcripts. I’ve made a few proof of concepts using embeddings and have some questions.

  1. If I embed the whole transcript (which often doesn’t fit) it seems wasteful. If I create embeddings with simple heuristics like sliding windows, it’s hard to optimize for relevancy given how freeform transcripts can be. Is there a method for identifying the optimal text to create an embedding for?

  2. The transcript document quality is a bit low but there is separate metadata I can add to it. Such as who is speaking, or a summary of the conversation to that point, topic labels, etc. Are there any best practices for cleaning text to create an embedding?

I embed based on who is talking. Each embedding also include metadata like the subject of the meeting. I also skip embedding if what the person said is less than a present number of words.

Something I’ve been thinking about is including both the last thing someone else said, and the following thing, to give more context. But I haven’t tested that yet.