Fetching relevant news from dump of data (newsletters,Youtube,RSS feeds ...)

steven.baert · March 5, 2024, 9:23pm

Hi,

I made a Python script that fetches all my newsletters (on a specifc topic), favorite Youtube channels and RSS feeds.

Now I would like to scan through this dump and find topics which are mostly mentioned or which I predefine are important (f.e. news on ChatGPT/OpenAI).
That way I do not need to read the same news over and over in different formats

There obviously is a limit in context for OpenAI, how do I go through this data, import it into a temporary vectordb like Chroma (with HuggingFace embedding to save some costs, heard there are pretty ok) … but then next steps, how would you continue? Curious about your approach(es).

Thanks for your input!
S.

Diet · March 5, 2024, 10:32pm

Hi!

I did similar things a couple of times for [redacted], but the process is relatively straight forward:

use an LLM to extract claims. Each news article asserts some claims to be true/facts.
temporally classify these claims. some of them are new (‘news’), some of them are context recaps. The idea here is that you don’t accidentally want to be contrasting/collating old events/perspectives with new ones.
embed the claims and collate them. use embedding retrieval to compare and contrast assertions: they can be identical, orthogonal, antithetical, or unrelated. If they’re part of the first three, they belong together and should be presented together. Stash that polygram, and augment it with more data if new stuff emerges.
You should then have a collection of distinct polygrams, and those you haven’t seen yet are your personal news.

steven.baert · March 11, 2024, 4:16pm

Thank you for the effort, but the details on your methods and tools were unclear. While I appreciate the input, it wasn’t particularly useful for me.

I gather data from various sources, primarily email newsletters and YouTube channels. My focus is on acquiring recent content, eliminating the need for long-term storage. The main challenge lies in filtering out repetitive information to avoid revisiting the same content presented differently across newsletters or YouTube channels.

Currently, I collect data and store it in a JSON format, then process it with a large language model for assistance. However, this method seems inefficient, and I’m seeking a more streamlined solution. The challenge is to go through large volume of data, seen the token limit. So I need to summarize summarizations to be able to find unique data and thereby loosing context.

Diet · March 11, 2024, 11:35pm

https://platform.openai.com/docs/guides/embeddings

this is just a starting point, but that’s what I meant

DigitalDoge · March 12, 2024, 4:25pm

+1 on vectorizing the data. I’ve never done it, but seems like a good usecase for this. I’d imagine it would be a different vector for each record, and then you’d want to query the newest vectors.

stevens · March 12, 2024, 5:18pm

Thanks, I do know about embeddings, EmbedChain can do it for you if you don’t know how btw.
Vectorizing is the same as embedding, so not sure what you mean by that.
It’s not a question for me howto do technically, it’s a matter of embedding lots of data the right way, with optimal pre processing. Then retrieving the info with correct context. F.e. give me all relevant nes you found recently on XAI in newsletters, YouTube, Twitter feeds and it should unique info with reference to all websites, YouTube, mailing lists so I can reference them when I write a blogpost.

DigitalDoge · March 12, 2024, 5:22pm

There’s a way to “tag” the vector records right? Maybe you can tag them with a date/time stamp, then define what “recent” means when you query the data? (Just spit-balling here.)

Topic		Replies	Views
How I cluster/segment my text after embeddings process for easy understanding? API	13	13019	December 18, 2024
How to feed data for completions, instead of using prompt/answer fine-tuning format? API	25	17772	December 17, 2023
Embedding past conversation data for context memory & retrieval API	8	2479	January 8, 2024
Example incorporation into query formulation API	14	1323	December 16, 2023
Vector embedding notes and chat history API embeddings , chat-completion , vector-db	4	2331	June 6, 2024

Fetching relevant news from dump of data (newsletters,Youtube,RSS feeds ...)

Related topics