Hi, if this isn’t the right place to talk about this sort of thing, could you help me find that place? Thanks.
I’m working with (IMO) a large set of message board data. I’ve got my ETL pipeline down to the point where I can really start paying attention to inference now, and so obviously I have a ton of questions. But the hardest one to wrap my head around is the granularity and length my summaries should be. I’ve realized that artful summarization is very important in being able to access a large amount of semantic data quickly (am I wrong here? What are the practical limits to just embedding every message and searching all of it? It’s ~1 TB of html but based on my processing so far I’m guessing 100GB of pure prose).
But how to decide exactly what to summarize? I am processing data in reverse order of Recommendations (user curation), and doing the entire thread that that message is a part of (to correctly tag and attribute quotes). I’ve done maybe the top 1000 threads but the summaries just don’t seem all that helpful. They are mostly generic “There was lively discussion about the topic” sorts of things. Should I be doing something more structured?
I haven’t gotten past threads yet but am thinking of summarizing the message boards at the board level, and also perhaps quarterly (to measure change over time), and I guess authors could work the same way. Quarterly summaries could roll into yearly ones which would inform the author- or board-level summary.
Also, how big do I make them? Where is discussion on the balance between data size and semantic meaning? If I summarize a 500 rec, 5 page essay in two sentences, I am losing a lot more information than, say, a 2 paragraph message with no recs in two sentences. What metrics do I use to dynamically determine the proper length of the summary?
Or if I am just thinking about this all wrong, feel free to say that, too. Thanks.
I am running into the same issue where thread level summaries end up vague and not very useful. Right now it feels like I am summarizing activity instead of information. I am starting to think structure and intent matter more than raw length.
Hey, it’s nice just to know someone else is working on it, and also having problems!
What are you using to visualize the data? I have the beginnings of a document clustering system, which in my mind I can use to refine the summarization process, which will in turn make the clustering work better in a virtuous circle. Not quite there yet. I’m also interested in visualizing a knowledge or social graph but haven’t started on that yet.
My general plan has been to come up with an easily repeatable ETL that will let me test several different tactics on a well-known subset of data. I think I have enough processed data now to tell me what I need to change, it’s just hard to see me getting a machine to do it well if you want something done right…
Well it’s kind of just a pet project, with very nebulous goals. It’s message board data from a financial website from 1997 - 2010 or so, and my stretch goal is to develop heuristics using the secrets inside to generate a market buy/sell signal that can be used in real-time on current data (some other message board, I guess). But that’s just the dream…
I have also just learned the concept of “sematic condensation” and maybe how that’s what I should be doing instead / in addition to a prose summary. Compared to the ETL process, which I am pretty familiar, this side of things involves a lot of design decisions I have no experience or business making!
Wall Street banks, hedge funds, and others are already doing something like this - in real-time. I know because I’m a day trader in my spare time. However, non of these guys are going to pay attention to what someone says on Yahoo Finance. Why? Because of the potential of market manipulation. They rely on verified data and rightly so.
I don’t want to discourage you, but something for you to think about.