Fetching relevant news from dump of data (newsletters,Youtube,RSS feeds ...)


I made a Python script that fetches all my newsletters (on a specifc topic), favorite Youtube channels and RSS feeds.

Now I would like to scan through this dump and find topics which are mostly mentioned or which I predefine are important (f.e. news on ChatGPT/OpenAI).
That way I do not need to read the same news over and over in different formats :slight_smile:

There obviously is a limit in context for OpenAI, how do I go through this data, import it into a temporary vectordb like Chroma (with HuggingFace embedding to save some costs, heard there are pretty ok) … but then next steps, how would you continue? Curious about your approach(es).

Thanks for your input!


I did similar things a couple of times for [redacted], but the process is relatively straight forward:

  1. use an LLM to extract claims. Each news article asserts some claims to be true/facts.

  2. temporally classify these claims. some of them are new (‘news’), some of them are context recaps. The idea here is that you don’t accidentally want to be contrasting/collating old events/perspectives with new ones.

  3. embed the claims and collate them. use embedding retrieval to compare and contrast assertions: they can be identical, orthogonal, antithetical, or unrelated. If they’re part of the first three, they belong together and should be presented together. Stash that polygram, and augment it with more data if new stuff emerges.

  4. You should then have a collection of distinct polygrams, and those you haven’t seen yet are your personal news.



Thank you for the effort, but the details on your methods and tools were unclear. While I appreciate the input, it wasn’t particularly useful for me.

I gather data from various sources, primarily email newsletters and YouTube channels. My focus is on acquiring recent content, eliminating the need for long-term storage. The main challenge lies in filtering out repetitive information to avoid revisiting the same content presented differently across newsletters or YouTube channels.

Currently, I collect data and store it in a JSON format, then process it with a large language model for assistance. However, this method seems inefficient, and I’m seeking a more streamlined solution. The challenge is to go through large volume of data, seen the token limit. So I need to summarize summarizations to be able to find unique data and thereby loosing context.


this is just a starting point, but that’s what I meant :slight_smile:

1 Like

+1 on vectorizing the data. I’ve never done it, but seems like a good usecase for this. I’d imagine it would be a different vector for each record, and then you’d want to query the newest vectors.

Thanks, I do know about embeddings, EmbedChain can do it for you if you don’t know how btw.
Vectorizing is the same as embedding, so not sure what you mean by that.
It’s not a question for me howto do technically, it’s a matter of embedding lots of data the right way, with optimal pre processing. Then retrieving the info with correct context. F.e. give me all relevant nes you found recently on XAI in newsletters, YouTube, Twitter feeds and it should unique info with reference to all websites, YouTube, mailing lists so I can reference them when I write a blogpost.

There’s a way to “tag” the vector records right? Maybe you can tag them with a date/time stamp, then define what “recent” means when you query the data? (Just spit-balling here.)