Monday, July 15, 2013

Structure of the data layer
















Section, Ttopic, Cluster, Entity, Article - these terms are mirrored as classes in my Python code. The image above is a visual representation the structure of my code at this time.

As one can see, the structure is tree-like, except for two exceptions noted along the right-hand side - trending topics (ttopics) and articles should probably not be restricted to belonging to a single section and entity respectively. I am still working on implementing this behavior, and it is not represented in the image above, but it would mean that leaf nodes on the ttopic and article levels would branch upwards as well.

Below is a quick description of the utility of each class:

Section
pulled directly from Google News, these are the available news sections: Top Stories, World, U.S., Business, Technology, Entertainment, Sports, Health, Science.

Trending topic (Ttopic)
also pulled directly from Google News but dynamically changing, these are the most popular news topics for each section at any given time (similar to Twitter's trending topics which show keywords 'tweeters' are using more or growing fastest)

Article
news articles pulled directly from the Google News RSS feed for each ttopic. instead of associating articles with a ttopic directly, they are first associated with the entities (discussed below) that were extracted from them (this is what causes repetitive use of a single article between entities).

Entity
short for the concept of a 'named entity' from the information extraction world. words from article titles categorized as the names of persons, organizations, and locations with the help of Stanford NER from The Stanford NLP Group (there are other predefined categories that are not currently used in this project). once identified, entities are tallied by frequency per ttopic using PrefixSpan.

Cluster
a wrapper for entities. groups similar entities for a ttopic to mute redundancy in information displayed on future client.

No comments:

Post a Comment