Fitzgerald Steele

Usability, User Experience, Social Media, Web Design and Development…

Social Media + Open Access: Transforming Politics and Academia in One Shot

leave a comment »

This is a great use of statistics by a couple of PhD candidates at New York University, identifying and quantifying some strong irregularities released by the Iranian Government.

The Devil Is in the Digits: Evidence That Iran’s Election Was Rigged – washingtonpost.com

The numbers look suspicious. We find too many 7s and not enough 5s in the last digit. We expect each digit (0, 1, 2, and so on) to appear at the end of 10 percent of the vote counts. But in Iran’s provincial results, the digit 7 appears 17 percent of the time, and only 4 percent of the results end in the number 5. Two such departures from the average — a spike of 17 percent or more in one digit and a drop to 4 percent or less in another — are extremely unlikely. Fewer than four in a hundred non-fraudulent elections would produce such numbers.

Even more than that, this demonstrates the power of cheap, easy access to information over the internet.  Two students can pinpoint problems with election results half a world away, publish them, and have them copied, shared, bookmarked, re-tweeted all over the world in a couple of hours.  Using publicly available data, and free software.  We haven’t even begun to figure out all the ways in which the internet, social media, open source and open access is changing the way the world works.

This is another nail in the coffin for traditional academic publishing.  If and when Scacco and Beber write this up as a journal article, the results won’t be publicly available until it goes through the rigorous academic review cycle, which could be anywhere from a few months to several years.  But the events in Iran are happening now.  Protestors are dying now.  Publishing these results in a respected newspaper gets the results out much more quickly.

Furthermore, the authors made the data and statistical analysis code available for anyone in the world to download and run (remember, its all free software, and free data).  This means that instead of a couple journal editors validating their work, they’ve effectively crowdsourced the review process.  Brilliant.  I don’t know how traditional academic journals can continue to pretend to be socially relevant in a world where anyone can make their work public, that work can be independently verified in minutes instead of years.

Update: The annotated version of the Scacco and Beber paper points to a couple other articles that show statistical irregularities in the Iran voter counts: In the first digit and second digits.

Written by fitzgeraldsteele

June 22, 2009 at 8:22 am

ICWSM Paper Titles – Wordle

leave a comment »

This tag cloud was generated from all the paper titles that were presented at the ICWSM ‘09 conference (http://www.icwsm.org). I don’t think anyone is surprised that ’social’ is the major term.

Written by fitzgeraldsteele

May 27, 2009 at 10:09 pm

ICWSM Liveblog – Wordle

leave a comment »

This tag cloud was generated from my liveblog of the ICWSM conference (http://fitzgeraldsteele.wordpress.com/tag/icwsm). I think it is interesting that people shows up as the biggest term here, where it hardly registers in the paper titles.

Tagcloud generated by www.wordle.net

Written by fitzgeraldsteele

May 27, 2009 at 10:08 pm

ICWSM Keynote: Jon Kleinberg – Meme-tracking, Diffusion, and the Flow of Online Information

leave a comment »

Intersection of news media, technology, and the political process.  Modern SM technology is a disruptive technology, similar to radio/TV in the 20th century.  How does information transmitted broadly by the media interact with the personal influence arising from social networks?

SM erases difference between global and local influence, making more of a continuum.  Speed of media reporting increasing, contributing to a 24 hour news cycle.  “A Challenge to healthy discourse.”  Online media also adds complexity to how political info flows through social networks.

The dynamics of the global news cycle

Examined if the ‘news cycle’ is a metaphorical construct, or is it visible in data.  If it’s visible, can we measure it, describe it?  Used data from Spinn3r, looked at 1M news articles and blog posts per day, 20K sources.

What basic “units” make up the news cycle?  Need some aggregate of articles, vary over the order of days, and handles half-terabyte of data.  Look for “memes”, identify text fragments, phrases, quotes that travel through many articles.  They create a weighted, directed, acyclic graph of mutational variants, that delentes min total edge weight such that each component has a single “sink” node.  This problem is NP-hard, but can apply heuristics based on selecting a single edge out of each quote.  Produces a neat stacked histogram graph that shows the relative frequencies of stories related to a particular quote over time.

Use some analogies to describe temporal variations: eg species competing for a resources in an eco system, or biological systems that synchronize to favor a small number of individuals at any point in time.  A model to describe this might include: imitation term, recency term.

Found a 2.5 hour gap between peak intensity of the story in mainstream media, vs when it peaked in the blogs.

Can also use the data to find stories where blogs lead the media.

The spread of political messages through social networks

Might look at Chain-letter petitions as ‘tracers’ through global social network.  These are good because 1) they are viral – only get via email, 2) comes with its own tracer (signatures on it).  Can’t see the full tree, but copies get posted to mailing lists, which can be found by search engine.  So they can build a partial tree, compensating for the mutations in the signature tree.

It turns out genetic mutation analogies are good…all kinds of mutations happen (people erase names, put funny names in the middle, etc).

Built the tree from two chain letters, and it looked funny.  If we’re in a small world network (six degrees of separation), why is the tree very deep and narrow, like a depth-first search tree.  Why?  Possible timing effects, assuming that nodes act on messages according to some delay.

So we can make some initial analogies like mutation, biology.  But these are really complex, global phenomena, that require richer models and knowledge of human behavior.  Ideas from computing and online media will be crucial to the next steps.

Written by fitzgeraldsteele

May 20, 2009 at 10:02 am

ICWSM Session 6: Modeling Social Dynamics

leave a comment »

Stochastic Models of User-Contributory Web Sites

Interested in modeling how to people view and rate existing content.  The talk is an extended example using Digg.

Votes on stories is a combination of visibility (do they see the story) and interest (do they like it, vote on it).  In this experiement, they don’t have info on visibility so they need to model it.

Their model captures key Digg qualitative features: slow initial, fast growth as it gets more views.

A model for promotion of an article is created.

Stochastic process approach used to connect user and system behaviors.  Applies to users with limited information and tasks


Personal Information Management vs. Resource Sharing: Towards a Model of Information Behavior in Social Tagging Systems

Why do people tag?  Towards a model of tagging as info interaction behavior.
Is tagging a way to get around the vocabulary problem (different communities, different terms)

Emerging tag models for Language (Linguistic Tag model), function, tag-relationship.  Found almost all tags relate to content, not time, task, emotion
TACS – web based tool for tag analysis
Used Amazon Mechanical Turk as a cheap way to get survey subjects, although there may be some problems (verification, biased population, platform)
Assume different motivations for tagging.  Organizing your own content (PIM) vs Media and information sharing.
Designed a questionnaire of Delicious, Connotea, Flickr, YouTube users  7pm Likert scale

Qualitative analysis showed strong differences in motivation for using different sites.

Ease of tagging not significantly different.  Tagging is useful (connotea users really think so).

Compares to Shneiderman 2002 Two dimensions of social interaction (activity vs. social sphere)

In terms of IR, people thought tags on flickr/youtube were more helpful than delicous/connotea.  I’m surprised by that…I use tags on delicous to locate information all the time.  For me, its one of the key features.  When I asked the speaker, he said his qualitative/quantitative results had no indication of that type of behavior.  I think that’s really interesting.  Time for a paper?

Activity Types (Cool & Belkin 2002).  May be applicable, but lacks a social dimension.

Motivation, Structure, and Tenure Factors that Impact Online Photo Sharing

Why do people in online communities share?  Photos, info, meta-information, code.  Want to quantify drivers for sharing and actual behavior.  Can look at the area in terms of WHY people share, WHAT they share, and WHERE.

Note a difference between creating and sharing.  They are separate, but many studies assume creation is coupled with sharing. Looked at Flickr data; combined survey data with system reported data.

Looks at 3 factors: Motivation (Intrinsic vs Extrinsic, Self vs. others)

Structure: Number of contacts

Tenure: Years since started sharing

Looked at artifact sharing per year tenure.

I wonder why they went shares/year, not per month.  Seems like you could really see different outcomes for people that post habitually, vs people that share their one time trip.

Commitment, Number of contacts positively correlated with sharing.  Personal Enjoyment is not correlated (maybe because people motivated by creating more than sharing).  Self-development is negatively correlated with sharing (maybe because they are more interested in quality than quantity).  Time since first upload strongly negatively correlated with sharing (the longer you’re with a community, the less likely you are to share).  Maybe because of loss of interest.

Modeling Blog Dynamics

The blogosphere is a system of interactions of posts, topics, links, etc.  The purpose here is to create a generative model of the blogosphere that matches properties of the real blogosphere for prediction and motivation.

Actually 2 networks combined into one: Blog network and post network.

Goal: Model micro-level interactions to create the macro-level patterns (structure, and dynamic over time) of the blogosphere.

Structure/Topological Patterns: Power Laws (interposting time)

Temporal/Dynamic Patterns: Burstiness and Self-similarity

Proposed Model: ZC

In every timestep, for every blog, assign a state as part of an FSM, depending on how likely they wil blog.  If they blog, randomly decide if they will create a link to a neighbor or ‘random blog’.

This creates a post distribution, burstiness, post popularity similar to real blogosphere.

Written by fitzgeraldsteele

May 19, 2009 at 4:04 pm

ICWSM Session 5: Panel Discussion – System Design and Community Culture

leave a comment »

System Design and Community Culture

The role of rules and algorithms in shaping human behavior

Panelists:

  • Lukas Biewald, Dolores Labs
  • Rashimi Sinha, Slideshare
  • Cameron Marlow, Facebook

Dolores Labs – Making Crowds Efficient and Reliable.  They pay people to perform tasks, aka Amazon Mechanical Turk.

Slideshare – Focus on social design.  Presentations are fundamentally social – you don’t make them for yourself.  The social networking tools (commenting, favoriting, tagging) has lead to the creation of a community.

Facebook – Runs the Data Science team, which uses machine language and research to understand how users use the site, and that leads to design changes.

Examples of Unexpected Community Behavior?

RS: What gets spam, what does not.  Particularly in their comment system.  They went through lots of iterations

LB: Prompting a task affected the outcome.  So now they work with people to define

What sort design decisions are based on difficulty?

LB: Try to break a task into the smallest possible unit.

RS: Presentations are less frequent than say photos, so there are different rules.  Also differentiate between user types: content creator, readers, aggregators.

CM: Facebook isn’t really designed around a task.  They do lots of things to enable use at different levels.

Range of tasks across the three systems.  How do you learn how social interactions change tasks?

RS: Observed real life events (people gather around a presentation). Create a unit, and a construct around that.

CM: FB tries to lower the barrier of trying now tasks.  For example, someone can upload a photo, others can tag photos, add metadata, etc.

Design by Intuition vs. Design by Data.  What is your approach/process in developing new features?

RS: Start with intuition, primary hypothesis.  Look at what data in the world.  Once its up, there’s lots of data to see what people like, what people talk about.  Also do AB Testing.

LB: Can nicely segment users along whatever dimension you want, so you have lots of options.

CM: People react to change.  Some like it, some hate it.  What fraction of the population respond to the change.

We know you can prompt people to get certain types of behavior.  How do you compensate for that?

RS: Not so worried about that — doesn’t have to be scientific.  Of course, you can also do experiments to deal with it.

CM: There are many sources of bias in these large ecosystems.  Important that decision makers know about them.

Community, communicate, share.  What makes for successful conversation?

CM: Allow them to happen at a different scale, use aggregated tools to understand entire conversation.  For example, they have a tool that can find a term/keyword across all of Facebook, as a percentage of all text.  Helps them make sense (in some small way) of everything.

RS: Twitter hashtags are a really good, scalable way to communicate a topic.  Well, maybe partially scalable.  When a hashtag makes twitter trending topics, bots take over.  But things are good up until then.

How do people discover your content, features?

RS: Email, social network links, but mainly Google search

CM: The Wall.  Now have two feeds: 1 real time, 1 algorithm driven.

Twitter innovations: #hashtags, @replies, ReTweets – users came up with those.  How do you design so that users can extend the design on their own?

RS: Initial version of Slideshare was barebones.  Keep the initial design to the core, get feedback, refine.  Build new features based on what works.  Also, develop and API so people can extend your site.

CM: Design a platform so that people can build their own specific tools.

How do you enable the conversation/feedback between designers and community?  How do you differentiate edge case complaints vs real problems.

LB: Designers do customer support

RS: Ditto.  Also, use numbers, percentages of people that complain.

CM: Collect as many signals as possible.  If something shows up across many areas, it may be a real problem.

Written by fitzgeraldsteele

May 19, 2009 at 2:48 pm

ICWSM Session 4: Data Mining and Sentiment Analysis

leave a comment »

A Categorical Model for Discovering Latent Structure in Social Annotations

This paper describes a model to the structure of semantic topics over documents using tags.

They propose a community-based categorical annotation model.  Communities form around interests, expertise, language, etc.  Each community has a number of categories as its world view.  Therefore the community draws tags about a document from its list of categories.  Use Gibbs Sampling to recover communities and categories.  This gives you a distribution of communities and categories.

Used a corpus from flickr and delicious.com to do experiments.

First looked at similarity between content-based topics and tag-based topics.  No real similarity.

Propose an example application: Information Access via Topic and Category.  Given a particular page, you can see pages that are similar in content, tags, both, or neither.

My question is about a person’s membership in a community.  When I look at an object, I’m influenced by the tags…it becomes part of my vocabulary.  So really I have a distribution of membership in the different categories.  I’m not sure if/how this work takes that into account.

Content Based Recommendation and Summarization in the Blogosphere

Given a set of blogs related to a topic, find a subset of blog feeds to read that have interest in the topic.  Previous work is on link popularity (PageRank, HITS).  This one looks at content similarity.  It builds a blog post network graph, where directed and weighted edges indicate links from one post to another).  Blog importance defined as the importance of the adjacent blogs.  Post similarity defined by TF-IDF.

This also defines a diversity ranking, and discounts nodes that are too similar to previously selected nodes.  It also adds a user-defined quality factor.

Experiments used BLOG2006 dataset.  Calculate node quality using a Linear Threshold diffusion model.  Compared this algorithm to a random selection of blogs, a simple heuristic for selecting blogs, and a greedy algorithm.

Leskovec has a good paper on selecting interesting blogs using a diffusion model.  It takes into account the fact that there are a lot of repeated stories/topics — I don’t need to read all of them.  I wonder if he could compare this model to that one.

Also, I’m not clear is to whether he’s selecting whole blogs or blog posts.  The algorithm is based on blog posts, and that may not be very accurate.  For example, the most popular post on this blog (by far) is about Pinax deployment, but I haven’t really written about that much in the last year.  Would this entire blog be flagged with this algorithm?

I think there are a lot of techniques for solving this problem.  I propose to evaluate them by accumulating the blogs of the social media researchers

Targeting Sentiment Expressions through Supervised Ranking of Linguistic Configurations

Sentiment Expressions – single or multi-work phrases that express evaluation.  Assumes binary polarity (positive, negative).

Target – word/phrase which is the object of evaluation.  Sentiment expressions only like to physical targets.

So, given annotated mentions of sentiment anaylsis, find the target.  Did lots of manual annotations to come up with a gold standard against which they can test the targeting system (I feel for those summer interns).

Baseline approach: proximity.  Nearest mentione selected as target.

Baseline 2: Run a dependency parcer.

We use a supervised ranking.  Build independent classifiers for each sentiment expression/possible target pair, and a ranking algorithm to help select between them.  Classifiers trained using RankSVM.

Results: better than baselines.  80% precision/recall/f-score are the human max: this system is around 70%.  This approach definitely beats bag of words, proximity.

Written by fitzgeraldsteele

May 19, 2009 at 11:48 am

ICWSM Keynote: Duncan Watts – Using the Web to do Social Science

leave a comment »

Duncan Watts knows about Social Networks.  Looks at the web as a tool for doing social science research.  Observing individual behavior, and interactions recorded in real time, for large populations.  Can the web lead to a social science revolution?

This talk will describe four studies, each motivated by a substansive science question, and illustrate use of new technology that would not have been practical a few years ago.

Dynamic Network Analysis

We typically look at static networks.  If we look at how networks change over time, we can ask different questions

  • What factors affect tie formation/termination?
  • How do network properties change?
  • How constrained are individual choices

Used an Email Data network in a university community: 14.5M emails over 9 months, 43K members

Found that Structure Drives Evolution — network proximity overwhelming determines new ties.  This type of data can be used to build empirical models of network formations.  Can also look at network stability of various properties over time.  However, found that individual rankings change.

Web-Based “Macro-Sociological” Experiment on Social Influence

Why are cultural “hits” (eg, Harry Potter, Titanic, Michael Jackson) so much more successful than average, yet so difficult to predict?  It may have to do with “Social Influence.”  If everyone is influencing each other, what effects does it have on the market overall?

You couldn’t really do this type of experiment (comparing individual and marketwide) studies in a lab before the web (physical constraints, can’t go back in time)

Created a web based “Cultural Market” Music Lab.  Subjects where shown a grid of 48 songs by unknown bands,. You can listen to a song, and decide to download it.  Subjects randomly assigned to 2 conditions: see download counts (social influence), or no download counts.  Also broke the study into 8 different ‘worlds’.

Found social influence at individual level: people are more likely to click on songs in the top 10.

Calculated properties at Collective Level.  Inequality of Success (Gini Coefficient), Unpredictability of Success (average difference in market share of songs across R realizations of the world).  Found that when people know what other people think, unpredictability goes up, inequality of success higher than control.  Also found that ‘best’ songs never do poorly, ‘worst’ songs never do great.

Broader Impacts: Relevant whenever people base decisions on observations of others.  Market

s do not simply reveal stable underlying preferences. Institutions based on “Representative Individual” models may need to be revised.

Network Survey on Facebook

Some evidence that Americans increasingly group themselves with like minded individuals, some contrary evidence.  One hypothesis is that people are not as similar to their friends as they think.  This type of study is really hard to put together, execute.  Facebook made it easier.

Created “Friend Sence” Application.  Application asks you what does your friend think about various questions.  Got 2500 respondents, 12,160 complete dyads.  This is a known biased population, BUT A traditional study would have cost 200-300K and 2 years.  On Facebook, it took 2-3k, a couple months.

Results: Friends are more similar than strangers, but not as similar as they think.  It turns out people are unaware of much of what their friends think.

How Do Financial Incentives Affect Performance

Assumption that performance-based pay should result in better work than fixed pay.  This study asks whether an employer can elicit better performance from a given wooker pool by paying them more.

This study looked at web-based peer production (wikipedia, Y! Answers, Digg).  Used Amazon Mechanical Turk for crowd sourcing labor.  Subjects accept a job, receive an up front fee.  They are sent to another site, where they are assigned a task, and will receive a bonus for doing well.  Randomly assigned to 3 pay levels (low, medium, high).

Results: Subjects do more work for more pay.  Also, do less work for harder tasks.  However, found that increasing pay did NOT improve accuracy.

Also, found that people always thought they were being underpaid (in post questions, people said they should be paid more).

Tentative suggestion: Payment levels should be dictated by recruitment and retention, not direct impact on quality. (lots of caveats on the results of this study).

Concludes that social network platforms offer social scientists new tools to study social interactions, collective dynamics.

Lots of exciting progress in “network science”

  • physics, computer sicence, sociology, economics
  • massive scale
  • network experiments
  • Large, observational studies

Fundamental advances will require new approaches.  Need to study lage populations of individuals, plus interactions and behavior over extended time.

Social Science 2.0: Should address macro-level phenomena from a bottom up understanding of micro-level social interactions.

  • scale of data, experimentation, platforms still ufeasible for lone researchers
  • shared data storage, experimental platforms, subject pools needed
  • human subject research on the web = privacy and consent issues

I really liked this discussion because it relates some of the technical discussions at this conference (network analysis, data mining, etc) with larger social issues.

Written by fitzgeraldsteele

May 19, 2009 at 9:59 am

ICWSM Session 3: Ranking

leave a comment »

CourseRank: A Closed-Community Social System through the Magnifying Glass

This paper discusses a social-media course selection site for Stanford University.  It combines official university course information, grade distributions, and course reviews with user generated comments, reviews, etc.  Has a course planning/recommendations, course clouds to find courses related to certain topics.

85% of Stanford undergrads use the site, way more than open community sites.

Using Tranactional Information to Predict Link Strength in Online Social Networks

Analyzed the Purdue Facebook network.  Generated different friend graphs for: Friends, Wall Posting, Pictures.  The Wall/Picture graphs have a much lower InDegree/OutDegree than the Friends network.  This may indicate that the wall postings may be a better indication of who your ‘real’ friends are.  I thought it was interesting that people had, on average 21 people writing on your Wall, but you only write on 7 people’s Wall.

Used the ‘Top Friends’ application as ‘truth’ of who your top friends are.  This paper compares 3 types of supervised learning algorithms, and four types of features to predict link strength through four separate experiments.

  • Experiment 1: Found 12 of 15 top features are network-tranactional type features, with wall information used best.
  • Experiment 2: Network transactional features had highest accuracy
  • Experiment 3: Compared link type.  Wall features had best accuracy.  Picture information quite bad
  • Experiment 4: Bagged decision trees had the best accuracy.  97% of performances comes from network transactional features

Network transactional features take into account transactions between person A to person B, moderated by # transactions A makes to everyone else.

RevRank: A Fully Unsupervised Algorithm for Selecting the Most Helpful Book Reviews

We use product reviews to make purchasing decisions.  Many reviews (on Amazon) are repetative, limited contribution, poorly written, unnoticed (and, as we learned this morning, confusing or plagerized).  Amazon has User Voting, which has some problems (imbalance vote bias, early bird bias, Winner Circle bias).

This work locates helpful reviews based on dominant concepts.  Term Dominance is similar to TF-IDF.

Examined 12,000 reviews of 5 books.  Compared algorithm to a human user vote and random sample.

RevRank did a good job of finding ‘helpful’ reviews, better than the other two conditions.

Written by fitzgeraldsteele

May 18, 2009 at 5:51 pm

Posted in confernece, research, social media

Tagged with ,

ICWSM Session 2: Psychology and Users

leave a comment »

Does showing off help to make friends? Experimenting a sociological game on self-exhibition and social networks

Used the site http://socialgeek.com to gather data on how people would choose to portray themselves on a social network site.  Showed them various types of pictures (provocative, standard, showing off, body immodesty) and asked whether they would use the photo as a profile picture.

Found a correlation between number of friends and the self-exhibition.  They suggest that people may use ’show off’ type pictures to gain online friends.

Also found that people like to be friends with people like them (similar age, socio-economic session, etc).  Except people in the study preferred to be friends with women.

I think this is sort of related to our study of online profile photos, but I’m not exactly sure how.  This paper used a different personality scale, so I’m not clear how we can compare the two.

What Are They Blogging About?  Personality, Topic and Motivation in logs

One way to categorize motivation to blog:

  • Internal (documetning lfe, catharsis)
  • External (Interests, Opinions)

Using the Five Factor Personality Model to make some hypotheses about personality.  Did some text analysis on a blog corpus from BlogMetrics using LIWC text analysis tool, as implemented in TAWC.  For bloggers high in these factors:

Neuroticism: self-therapy/catharsis – focusing on self and venting purely negative feelings.

Extraversion: Talk alot about themselves and other people.  Use lots of 1st person, 2nd person, 3rd person pronouns.  Used lots of positive emotion words.

Openness: Review/evaluation of leasure interests from personal perspective

Conscientiousness: Faithfully document life going on around them, references to others.  Lots of talk about their job, people around them.

Agreeableness: positive self-talk focus, negative emotions and leisure activities avoided.

As part of the tutorials yesterday, I took a simple Five Factor Personality test, where I scored high on Agreeableness and very low on Neuroticism.  I’d like to look at my blogs and see these findings describe my own behavior.


A Social Identity Approach to identify Familiar Strangers in a Social Network

Familiar Strangers: People you observe repeatedly, but do not know each other.  In real life, people you see daily on the train.  Online, similar blogging behavior, interests, but not on the same social network.  It would be nice to find these people, to understand more niche interests, do predictive modeling and trend analysis, etc.

This is interesting because it focuses on trying to find and connect people with narrow, niche interests (the long tail of the blogosphere).

They use a Social Identify approach.  People cluster contacts into meaningful groups.  So we really propagate the search through relevant clusters of contacts.  We limit the search space.

Used blog tags and content to generate a vector that describes a blog, and then calculated similarity using cosine adjacency.  Clustered with k means.  Compared their results against 1) exhaustive approach, 2) random approach.

Results indicate the Social Identify approach has accuracy between 80-90%, depending on the dataset — much better accuracy than random, but much faster than an exhaustive search.

This research assumes an egocentric search.  You look first at people that are connected to you in the network.  But that doesn’t seem realistic.  I can find familiar strangers on sites like delicious.com or twitter via tag cloud, rather than searching first through my contacts.  I asked the speaker this question.  He suggested that his approach would be helpful in locating people near the cluster of people that use a particular tag, but not the precise tag.

You Are Where You Edit: Location Wikipedia Contributors through Edit Histories

Exploring the increasingly prevalent role of geography on the web.  Allows geographically informed content retrieval, filtering. Potential invasion of privacy.  Looked at Wikipedia Geopages – pages that correspond to a physical location in the real world, with lat/long coordinates.

This paper wants to know if we can characterize the location of the people who contribute to geopages.  Used DBPedia to bootstrap finding the geopages.  There’s a tradeoff in that wikipedia only collects single point, instead of an extent/area.

330K geopages.  They want to find contributors with a large number of edits to geopages constrained to a small area (~ 70mi x 70mi).

Over half of contributors make most of their edits on 1-2 pages.  Looked at 100 random user pages to determine motivation: most people live in that place, or where born there.

Written by fitzgeraldsteele

May 18, 2009 at 2:54 pm