Friday, June 28, 2013

PrefixSpan to the rescue!

Just when I had buckled down to concoct some crazy algorithm from my own head.

Just when I had written off my internship experience in Kyoto as wholly irrelevant to my life beyond that summer of 2012.

As it turns out, I have already spent extensive time "combining entities" (as I clumsily put it on my Trello board), only back then it was defined in much more elegant and precise terms.

My Problem Space

So since last night, my Python code 1) gathers the trending topics of each major news section on the Google News aggregator, 2) using these topics as queries, executes a search via the GN RSS feed for each query, 3) parses the RSS feed of each topic for all associated article titles, 4) uses named entity recognition (NER) to extract significant words in the article titles, and 5) computes the occurrence frequency of each unique named entity over a single topic.

The problem was that there was a lot of redundancy in each entity list. For example, my sample topic for testing was "Kim Kardashian," and (in honor of Kim and Kanye West's new baby) entities included:
- "North West" (6)
- "North" (3)
- "Baby North West" (2)
- "Baby North" (1)
- "North West Updates" (1)
- "Might Introduce Baby North West" (1)
- "Released Fake Pictures of Baby North West" (1)
- "Kanye West Explains Meaning Behind North" (1)

Wouldn't it be nice if we could get a more pronounced, less scattered sense of the buzz surrounding this new baby girl?

Enter: PrefixSpan

"PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth"
by Jian Pei, et al. (2001)

I honestly can't remember much about how the PrefixSpan algorithm works, and so I plan on revisiting this paper tonight. But let it be known: this is not a trivial algorithm - to think about or to implement.

Seriously, who knows how much time I might have spent (wasted?) rolling my own algorithm, or searching Google with all the wrong keywords, had I not recognized that I've seen this problem before.

Anyway, the tl;dr for the algorithm is as follows:
... [TODO] ...

Here's a diagram I created for my project's final presentation that represents the fundamental structure of the code:















And here is an illustrated example of how the algorithm works (courtesy of a slide from a previous intern working on the project I temporarily took over):

















With a naive, out-of-the-box implementation of PrefixSpan that I found online, the example problem above is condensed to:
- "West" (24) *
- "North" (16)
- "North West" (11)
- "Baby" (6) *
- "Baby North" (5)
- "Baby North West" (4)
- "Baby West" (4) **
* includes other occurrences not listed above
** an unexpected result [UPDATE 7/2/13: modified algorithm to prevent this]

Yes, this list is nearly as long - but it is clearly more focused, and the counts provide sharper demarcations.

Lesson Learned

The value of gathering exposure to all kinds of algorithms.

Thursday, June 27, 2013

I am a robot











...sigh. How does newsmap do it..? Hmm. I don't think it's my use of the RSS feed but rather the rate of my use of the RSS feed today. I was doing a lot of iterating and testing.

UPDATE: yay, the block was released by morning! so...

Lesson Learned

Impose a rate limit on myself before a rate limit is imposed on me.
Iterate and test on local copies.

Trello

I grew fond of Trello for organizing workload and assigning responsibility within my five-person team during CS 210, and so I have returned to it for my current project, despite being a team of one this time around.

New board started today!


Wednesday, June 26, 2013

The first pivot

On May 21, a random Tuesday*, I discovered newsmap, an application created in 2004 that sought a way to map out the news landscape via the Google News aggregator.

It was pretty demoralizing. It even used the term "news landscape" like I did.

Fortunately, my morale was lowered for only about a minute or two (literally, 1-2 minutes. I was actually surprised and proud of how quickly I bounced back. But these self-respect issues are the topic of another medium...).

I realized that I was completely unaware of the comprehensiveness and power of the GN aggregator. The fact that I was planning to aggregate news articles on my own via multiple, independent RSS feeds became laughable - the work was already done for me, by Google, on a level that far surpassed anything my solo development could have produced.

Now that I was more aware of GN, I grew excited about the possibilities for me to explore this treasure trove of news data. Even if I was beat to the punch on creating this news landscape I envisioned (beat by nine years at that), I could still mess around with the data as a neat learning experience.

Furthermore, I found that I had several unexplored ideas (e.g., infinite tweetflight!) that I could still pursue over my summer break. So maybe TechM was concluding only 10 days after the birth of this blog, before even one line of code was written, but I felt that I could easily re-purpose this blog to document my other ideas. This blog would be an authentic record of the life of a creative, curious coder through both the successes and the failures.

...and the pivots. Fast-forward to today. I am now on summer break after graduating with my bachelor's degree on June 16. I followed through and started playing around with GN this past week, and I don't think TechM is dead anymore.

how can I possibly remember that particular date from a month ago? Kudos to the "revision history" feature in Google docs that remembered when I placed newsmap in my project notes :)


The Pivot

First, let me write about newsmap a little more.

The application is certainly eye-catching and very cool. I appreciate the bold colors and bold fonts and its attempt at directing attention to the most important stories of the moment.

But once you actually start trying to consume the information, newsmap is overwhelming. There is too much data on the screen at once, the use of primary colors strain the eyes, and it is not obvious enough what to look at (and even if you do find some temporary focus, it is quickly drawn away by everything else going on on the page).

I think I can do this better. I think I can find a more simplistic, intuitive design for displaying a news landscape.

First off, my design will not have as much text on the screen at once. Instead of displaying full article titles, my plan has always been to display key entities - persons, organizations, products, events, etc. More information on these entities will be a click away.

When I investigated the Google News site closer, I noticed that Google already has a "trending topics" feature for each major news section. It is just a simple list of topics, but it exists. So I grew discouraged momentarily, thinking that the aggregator already had my idea covered. However, the topics list is just an afterthought for GN. I think there is more potential to be found here. I wish to build something that holds trending topics at its core, at its foundation. At the same time, once again, at some level the work has already been done; at this point in the project, I don't need to roll my own trending topics list.

So here's the pivot: TechM will employ GN's trending topics listing as a baseline. Then, it will go a level deeper and build an index, as first planned, but around the entities associated with each trending topic provided by GN. I like this idea because I trust the data GN delivers, and so directly using what it has deemed a trending topic (as opposed to using what I might have deemed a trending topic) provides a very solid information baseline for my product. Then with this established, I begin the work of replicating this trending topics implementation for the "subsections" (i.e., topics) provided by GN. Oh, and I'm also not limited to building this just for tech blogs (so TechM makes less sense... but it's okay because it's just a project codename).

The pivot could be considered minor to an outsider, but it was instrumental to me and my ability to get past the idea that I was merely being derivative. I feel excited about this project once again.

Now I have some Python code down, things are looking promising, and all is well. I hope to write a little bit about the coding side of things on this blog as well.

Lessons Learned

1) Make full use of the tools, packages, and data forms already available to me. Don't bother reinventing the wheel, unless the current wheel is significantly restricting my progress.

2) Just because something has been done before does not mean I can't do it better. Especially if "it" was last iterated on in 2005, and the brains behind the operation has moved on to bigger and better things like Flipboard (an app I love, by the way).

3) Try to remember that I am (at the present) coding a prototype, a minimum viable product... so don't get too carried away with software architecture