Seeking Best Practices for Generating Accurate Embeddings from Video Transcriptions

I am currently exploring strategies to efficiently convert YouTube video transcriptions into high-quality embeddings. My goal is to improve the relevance and accuracy of search results, especially as the vector database scales. Below is a simplified representation of the transcript format:

<transcript><text start="1" dur="5">Transcript Text</text></transcript>

Current Approach

I am using the model ‘text-embedding-ada-002’ to generate embeddings based on the cosine distance similarity metric. My current approach incorporates the video’s title, author and partial transcripts (250 word chunks) into the embedding content.

Content:
video.title by video.author - video.transcript (250 word chunks)

Metadata:
startTime, duration, videoId, channelId

Issues Faced

While this method performs satisfactorily on a smaller dataset, the relevance of the search results has diminished as the vector database has grown. For instance, a query like “Did the cat eat the pizza?” returned less relevant results such as “Best foods for Cat by Lisa - the cat ate tuna…”, even when there exists a more appropriate match like “The cat ate the pizza by Waldo - …”.

Queries

  1. Are there established best practices for creating embeddings from video transcriptions?
  2. What are some effective ways to optimize the content in the embeddings for better search relevance?
  3. Could the decline in relevancy be due to the scaling of the database, or could it be an issue with the ‘text-embedding-ada-002’ model itself?

Welcome any insights or suggestions you may have to improve the performance and relevance of the embeddings.

The most likely cause for the perceived drop in relevance is that with a larger database the possibility of hitting a cross chunk border sematic group is higher.

A cross chunk sematic group is a sentiment that starts in one chunk and ends in another and has little relevance for the start or end alone. Example :

“My cat has a red ball” chunked to “My cat has” and “a red ball” those chunks separately have little to no meaning for a query about red balls being played with by cats.

The solution is to overlap your chunks, that is, include 25% of the prior chunk’s text and 25% of the next chunks text at the start and end of the current chunk. by doing this you ensure that “most” semantic meaning gets stored in at least one chunk and is therefor logically searchable.

1 Like

I’m currently using a window of 250 words along with the video’s title and author to generate the vector embeddings. A live demo of the application can be found at FishGPT.

A query such as “How to Damascus paint” should ideally return the specific video titled How to Damascus paint which is in our vector database. However, it instead returns other videos related to lure painting.

While the current setup delivers reasonable results, I am eager to further improve the relevance of matches. Are there established practices for selecting an optimal chunk size in video transcript-based embeddings? Is the embedding content set like video.title by video.author - video.transcript (250 word chunks) optimal? Perhaps the video title and description need to be summarized and included into the embedding content as transcript data might be too sparse?

If you require further refinement of retrievals, you can include a number of the top results in a prompt to a higher order model, GPT-3.5 or GPT-4 along with a prompt containing a request to use the returned chunks as context and to select the best ones according to the users request:

“###{chunk examples separated by ***'s}### given these vector database retrieved chunks, which are most relevant given a user request of @@@{user_request}@@@”

But any topic that covers painting will be relevant to a request for topics that cover Damascus painting" unless you use a further API call requesting that the AI translate the users request from generic to specific, and use the AI’s response as the search prompt.