Friday, June 28, 2013

PrefixSpan to the rescue!

Just when I had buckled down to concoct some crazy algorithm from my own head.

Just when I had written off my internship experience in Kyoto as wholly irrelevant to my life beyond that summer of 2012.

As it turns out, I have already spent extensive time "combining entities" (as I clumsily put it on my Trello board), only back then it was defined in much more elegant and precise terms.

My Problem Space

So since last night, my Python code 1) gathers the trending topics of each major news section on the Google News aggregator, 2) using these topics as queries, executes a search via the GN RSS feed for each query, 3) parses the RSS feed of each topic for all associated article titles, 4) uses named entity recognition (NER) to extract significant words in the article titles, and 5) computes the occurrence frequency of each unique named entity over a single topic.

The problem was that there was a lot of redundancy in each entity list. For example, my sample topic for testing was "Kim Kardashian," and (in honor of Kim and Kanye West's new baby) entities included:
- "North West" (6)
- "North" (3)
- "Baby North West" (2)
- "Baby North" (1)
- "North West Updates" (1)
- "Might Introduce Baby North West" (1)
- "Released Fake Pictures of Baby North West" (1)
- "Kanye West Explains Meaning Behind North" (1)

Wouldn't it be nice if we could get a more pronounced, less scattered sense of the buzz surrounding this new baby girl?

Enter: PrefixSpan

"PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth"
by Jian Pei, et al. (2001)

I honestly can't remember much about how the PrefixSpan algorithm works, and so I plan on revisiting this paper tonight. But let it be known: this is not a trivial algorithm - to think about or to implement.

Seriously, who knows how much time I might have spent (wasted?) rolling my own algorithm, or searching Google with all the wrong keywords, had I not recognized that I've seen this problem before.

Anyway, the tl;dr for the algorithm is as follows:
... [TODO] ...

Here's a diagram I created for my project's final presentation that represents the fundamental structure of the code:















And here is an illustrated example of how the algorithm works (courtesy of a slide from a previous intern working on the project I temporarily took over):

















With a naive, out-of-the-box implementation of PrefixSpan that I found online, the example problem above is condensed to:
- "West" (24) *
- "North" (16)
- "North West" (11)
- "Baby" (6) *
- "Baby North" (5)
- "Baby North West" (4)
- "Baby West" (4) **
* includes other occurrences not listed above
** an unexpected result [UPDATE 7/2/13: modified algorithm to prevent this]

Yes, this list is nearly as long - but it is clearly more focused, and the counts provide sharper demarcations.

Lesson Learned

The value of gathering exposure to all kinds of algorithms.

No comments:

Post a Comment