Friday, July 19, 2013

v0.1













Here is a screenshot of the first version of TechM (shown on the 'Technology' section tab). The client has been implemented in Ruby on Rails and HTML/CSS, with intentional avoidance of JS for now (ending my philosophical tech debate temporarily). A few things you can't tell from the screenshot:
  • An asterisk (*) next to an entity indicates that the entity represents an entire cluster of similar entities. Hovering over the entity will display the rest of the cluster in a tooltip.
  • Hovering over any entity will display related article titles, also in the same tooltip (this 'feature' is only for creative purposes - it allows me to visualize the importance of article titles with respect to the context of each entity and trending topic, moving forward. I have no intention of showing article titles in a tooltip in a released version of the product, however.)
The second bullet point above really hints at the major advantage of this UI: allowing me to visualize the data in a more sophisticated way than I could previously. This is a nice step up from staring at the data in JSON format.

This version is not good enough for release for several reasons:
  • Obviously, the color scheme is fairly horrendous. I have yet to settle on a good one.
  • Many of these entities aren't useful without more context. Seeing a pile of entities related to each trending topic doesn't inspire me to continue clicking around and exploring the trending topic. This is a huge problem that I clearly need to improve upon.
Here are some thoughts for improvement:
  • Assign weights to clusters based on # of entities in cluster & entity frequencies and show only information related to weightiest cluster, a combination of entities + related articles
  • Only collect the named entities that occur *directly after* the trending topic in an article title... ('meh' on this idea)
  • Use a POS tagger so that instead of showing entire article titles after showing important entities, just show verb phrases
I'm also going to start using git branch to explore these options for how to display the data.

Overall though, it's exciting to have a working prototype of some sort! Even though it is very rough and needs more work.

Monday, July 15, 2013

Structure of the data layer
















Section, Ttopic, Cluster, Entity, Article - these terms are mirrored as classes in my Python code. The image above is a visual representation the structure of my code at this time.

As one can see, the structure is tree-like, except for two exceptions noted along the right-hand side - trending topics (ttopics) and articles should probably not be restricted to belonging to a single section and entity respectively. I am still working on implementing this behavior, and it is not represented in the image above, but it would mean that leaf nodes on the ttopic and article levels would branch upwards as well.

Below is a quick description of the utility of each class:

Section
pulled directly from Google News, these are the available news sections: Top Stories, World, U.S., Business, Technology, Entertainment, Sports, Health, Science.

Trending topic (Ttopic)
also pulled directly from Google News but dynamically changing, these are the most popular news topics for each section at any given time (similar to Twitter's trending topics which show keywords 'tweeters' are using more or growing fastest)

Article
news articles pulled directly from the Google News RSS feed for each ttopic. instead of associating articles with a ttopic directly, they are first associated with the entities (discussed below) that were extracted from them (this is what causes repetitive use of a single article between entities).

Entity
short for the concept of a 'named entity' from the information extraction world. words from article titles categorized as the names of persons, organizations, and locations with the help of Stanford NER from The Stanford NLP Group (there are other predefined categories that are not currently used in this project). once identified, entities are tallied by frequency per ttopic using PrefixSpan.

Cluster
a wrapper for entities. groups similar entities for a ttopic to mute redundancy in information displayed on future client.

Tuesday, July 2, 2013

Technology soup

From Medium:
In comes node.js, it seemed cool and would achieve what I wanted so I decided to start again...
I decided to have a client-side app and render the views in the browser as it was the “in thing” to do. I used all the goodies available to me such as Backbone, jQuery, Bootstrap and component. This tied to a node server running express.js and mongodb. To top it all off I wanted to be cool so I wrote the whole darn thing in CoffeeScript.
Now that TechM's data layer is solid enough for prototyping purposes (ahem, sans testing...), I'm trying to figure out how the heck I should connect to and develop for the client application. My mind, my Google searches, and my Trello board are all starting to reflect the chaos quoted above, scarily enough. I'm kinda getting lost in the sauce.

The thing is, I want to strike a balance. I want to learn new web technologies this summer, but I also want a working prototype by the end of the week... and between CouchDB + Node.js + Express.js + Mustache + jQuery + Flat UI Kit + ???, none of which I really know, it's looking pretty bad.

What I do know is Ruby on Rails... but I've already developed extensively with it. I also think it's too heavy for what I want to accomplish. But I'm missing the comprehensiveness of Rails, I think. With node.js, every component is customizable and needs to be installed individually via npm, and I haven't yet found a recommended, standard configuration of tools. I haven't found any ridiculously thorough tutorials for node.js yet either (like Michael Hartl's exceptional Rails tutorial).