Am I on the right track with embeddings?

My goal is to figure which of the top 20 trending news items per Google Trends relate to which of my 300 podcasts episodes (if any).

My approach:

For each episode, take the text summary (roughly 100 words) and send it to the embeddings endpoint, which returns an embed object like this:

0.047140226,0.021655217,0.049956247,0.01724814,0.005744687,-0.023809474,0.011503453,0.0070470977,-0.018318228,0.04925224,0.003560509,-0.05153322,-0.030582009,0.020782249,0.03953696,0.015051642,0.03886112,-0.023429312,0.0030325048,0.036946222,0.024006596,0.02066961,-0.03308827,0.0054666046,0.007624382,0.0009187275,0.012376421,0.035172127,0.05415212,0.00096448784,0.03061017,-0.01100361,-0.09957457,0.050660253,-0.020528808,-0.0022228982,0.009201355,-0.018388629,0.025696209,0.01976848,-0.021894578,0.025104845,0.0098631205,0.048660878,0.010 [... truncated for brevity]

Store that vector array in my MySQL database, as meta data against that episode.

Do that for each episode until all 300 of them have a vector array for their episode summary.

For each news item, do the same thing, though I’m just caching those vectors in memory since they change frequently.

For each news item, loop through each episode, and run the following php function on my web server (I got this from a different thread here on the forum):

function dot_product($news_item_vector_array, $podcast_summary_vector_array) {
	    $result = array_map(function($x, $y) {
	       return $x * $y;
	    }, $news_item_vector_array, $podcast_summary_vector_array);
	   return array_sum($result);
	}

Sort the results and say that if the dot_product() for a given news item / episode is over some threshold, I can consider that to be a valid “News Item → Podcast Connection”

I can’t believe this, but it seems to be kind of working?

Take this podcast summary for example:

Telling a clear story about your product is a basic entrepreneurial skill. But to build enduring impact, you need to help amplify other stories — those that surround you in your community and your customers.

Marcus Samuelsson has done just this with beloved restaurants such as Hav & Mar and Red Rooster, and through his media group that celebrates the richness of the world's cuisines and the stories embedded within them.

Marcus shares how embracing a diversity of stories has let him create spaces where every individual's narrative is valued, and has opened up new avenues of inspiration for him as an entrepreneur and award-winning chef.

Now, take this example news item, which I regard as a good candidate for a “News Item → Podcast Connection”:

The Red Rooster wins prestigious michelin star award

Well, that gets a dot_product of 0.339879951911

Now, take this other news item, which I regard as completely unrelated to the podcast episode:

The country music singer toby keith has died

Well, that gets a much lower dot_product of 0.0421805131692

At the risk of sounding incredibly naive … is this embedding?

1 Like

Yup :rofl: that’s a great use-case to find trending articles that relate to your podcasts as well.

It’s all perfectly fine as-is.

Going a bit further I’d recommend moving to a vector database and then saving everything. You may be able to find some interesting unrelated patterns from your podcasts and the trending news.

1 Like

Maybe you can help me understand one more thing.

In gathering a vector array for each podcast episode, I have been sending the episode summary rather than the full transcript. The reason for this is because the transcript tends to be around 10k tokens, and the embeds endpoint maxes out around 8k tokens.

What am I supposed to do here?

Do I break the episode up into chunks and do one api call for each chunk, and append them all together into one array to cover that episode?

A summary should be perfect for matching with the articles. But there’s options for tinkering:

  1. Run the transcript through an LLM to condense it
  2. Use a different embedding model that supports the size you’re looking for

If your chunks can be considered encapsulated (sub-episodes maybe?) I bet you could run a formula to “smart-chunk”. Opinion, beware but I am completely against arbitrary chunking. “but mah overlap”, bah.

Then I think chunking is an option. Instead of trying to mix/mash the vectors together you could prefix your chunks like:

Podcast Title: “Title”
Podcast Blurb: “A couple sentences of the podcast”
“transcript”

This may help group the podcasts together.

You could also apply some metadata tags to them, find them, and then calculate the centroid.

Tinker away!

Yep, currently tinkering. I am thinking that the nicely human written summaries for each episode, which I already have, might actually be more valuable than sending the whole transcript. Since the transcript might allow for false positives if the vectors happen to pick up on tangential topics rather than the main topic.

2 Likes

I’ve tryed this same way with MySQL. But let’s think a moment. MySQL doesn’t have capacity to compare arrays similarity, so we’ve to get MySQL data to client app to do the math, in a simplified way steps: MySQL read the disk, load data to memory, send it via network (consider latency) (if MySQL server in other server than the client app), receive data in the app, load data in memory, and than … do the math.

Considering the roadtrip above, I’m moving to Postgres to use pg_vector, which can do vector comparison directly in the database and return only the relevant results.

2 Likes

:+1:

Which I’ve been using for a while now and I can get top 10 similarity results from 10’s of thousands of records in a split second on a basic cloud server.

1 Like