ICWSM Session 4: Data Mining and Sentiment Analysis

A Categorical Model for Discovering Latent Structure in Social Annotations

This paper describes a model to the structure of semantic topics over documents using tags.

They propose a community-based categorical annotation model.  Communities form around interests, expertise, language, etc.  Each community has a number of categories as its world view.  Therefore the community draws tags about a document from its list of categories.  Use Gibbs Sampling to recover communities and categories.  This gives you a distribution of communities and categories.

Used a corpus from flickr and delicious.com to do experiments.

First looked at similarity between content-based topics and tag-based topics.  No real similarity.

Propose an example application: Information Access via Topic and Category.  Given a particular page, you can see pages that are similar in content, tags, both, or neither.

My question is about a person’s membership in a community.  When I look at an object, I’m influenced by the tags…it becomes part of my vocabulary.  So really I have a distribution of membership in the different categories.  I’m not sure if/how this work takes that into account.

Content Based Recommendation and Summarization in the Blogosphere

Given a set of blogs related to a topic, find a subset of blog feeds to read that have interest in the topic.  Previous work is on link popularity (PageRank, HITS).  This one looks at content similarity.  It builds a blog post network graph, where directed and weighted edges indicate links from one post to another).  Blog importance defined as the importance of the adjacent blogs.  Post similarity defined by TF-IDF.

This also defines a diversity ranking, and discounts nodes that are too similar to previously selected nodes.  It also adds a user-defined quality factor.

Experiments used BLOG2006 dataset.  Calculate node quality using a Linear Threshold diffusion model.  Compared this algorithm to a random selection of blogs, a simple heuristic for selecting blogs, and a greedy algorithm.

Leskovec has a good paper on selecting interesting blogs using a diffusion model.  It takes into account the fact that there are a lot of repeated stories/topics — I don’t need to read all of them.  I wonder if he could compare this model to that one.

Also, I’m not clear is to whether he’s selecting whole blogs or blog posts.  The algorithm is based on blog posts, and that may not be very accurate.  For example, the most popular post on this blog (by far) is about Pinax deployment, but I haven’t really written about that much in the last year.  Would this entire blog be flagged with this algorithm?

I think there are a lot of techniques for solving this problem.  I propose to evaluate them by accumulating the blogs of the social media researchers

Targeting Sentiment Expressions through Supervised Ranking of Linguistic Configurations

Sentiment Expressions – single or multi-work phrases that express evaluation.  Assumes binary polarity (positive, negative).

Target – word/phrase which is the object of evaluation.  Sentiment expressions only like to physical targets.

So, given annotated mentions of sentiment anaylsis, find the target.  Did lots of manual annotations to come up with a gold standard against which they can test the targeting system (I feel for those summer interns).

Baseline approach: proximity.  Nearest mentione selected as target.

Baseline 2: Run a dependency parcer.

We use a supervised ranking.  Build independent classifiers for each sentiment expression/possible target pair, and a ranking algorithm to help select between them.  Classifiers trained using RankSVM.

Results: better than baselines.  80% precision/recall/f-score are the human max: this system is around 70%.  This approach definitely beats bag of words, proximity.