Archive for category research
I knew that drawing visualizations of your Facebook friends network was fun, but I didn’t know it could get you published!
This Facebook network visualization was published in the Journal of Social Structure Visualization Symposium. As the author states, this visualization has some nice features:
- The angle of each wing is proportionate to its share of the network. Thus 25 percent of nodes go from 0 to 90º.
- Partitions are distinguished by their position rather than a node’s color or shape.
- The tail indicates the periphery of each partition. A wing with many tail nodes indicates many people who are only tied to other group members.
- Edges crossing the center show between-partition connections. Since nodes are sorted by degree it is easy to see if edges originate from the most highly connected nodes or the entire partition.
In this network it is easy to see a strong series of linkages between high school and university as well as high school and family. There are many ties between the current co-workers and professional colleagues, and neither connects substantially to high school. While just as populous, the professional partition is far less dense than the high school partition.
This visualization is oriented towards well-connected modular networks (meaning they are easily partitioned into distinct communities). Facebook egocentered networks often have these properties, whereby each partition represents a life course stage or social context and close friends link between partitions.
This graph was generated using NodeXL, and Excel based network tool
The Real Life Social Network is a great presentation by Paul Adams, UX Reseacher at Google, on the differences between online and offline social networks, and how those differences cause user confusion and even pain. One of the main reasons for this disconnect, he claims, is that online social networking sites tend to put all your connections in one big bucket (Friends on Facebook, Connections on linkedIn, etc), where in real life, across cultures, people tend to have 4-6 relatively independent groups of connections, with 2-10 people in each group.
This sounds about right to me, but I wanted to see if I could see this in my own social network. I used the excellent gephi network visualization and analysis tool, along with these instructions to generate a network graph of my facebook friends.
It looks like I’ve got about 7 discrete social networks (click the image to see more details, labels, etc):
- College friends (mostly from the Hawkeye Marching Band)
- Graduate School colleagues (fellow grad students and professors)
- Current Work Colleagues
- Church friends
- High School classmates
- Former Job Colleagues (mostly from when I worked and lived in England)
I learned a few things in doing this exercise:
- Facebook turns out to be a pretty decent proxy for my offline social network. If someone were to ask me, as Google did in their social network user research, to identify my people, place them in groups, and name the groups, this is pretty much the list I would’ve come up with.
- I’ve got more than 10 people in most of my groups. However, this graph doesn’t really take into account the strength of the connections. If I were to apply a filter to this graph that only showed people who posted on my wall, or who’s wall I posted on recently, I bet the number would be much closer to 10 per group. And some of those groups would disappear. Which leads to…
- These groups and connections are dynamic. My groups, and my attention to them, wax and wane over time.
- I didn’t need Betweenness Centrality to know that my wife is the center of my world. =)
Adams goes on to describe that the web is fundamentally changing because it is becoming a web not just of documents and data, but also of people and relationships. He argues that designers must learn to build systems with these new constructs. The desktop model of one person, dealing with one system, in a cozy office environment is broken. Relationships, influence, identify, and privacy must be built into next-generation systems. from the ground up.
This is a great use of statistics by a couple of PhD candidates at New York University, identifying and quantifying some strong irregularities released by the Iranian Government.
The numbers look suspicious. We find too many 7s and not enough 5s in the last digit. We expect each digit (0, 1, 2, and so on) to appear at the end of 10 percent of the vote counts. But in Iran’s provincial results, the digit 7 appears 17 percent of the time, and only 4 percent of the results end in the number 5. Two such departures from the average — a spike of 17 percent or more in one digit and a drop to 4 percent or less in another — are extremely unlikely. Fewer than four in a hundred non-fraudulent elections would produce such numbers.
Even more than that, this demonstrates the power of cheap, easy access to information over the internet. Two students can pinpoint problems with election results half a world away, publish them, and have them copied, shared, bookmarked, re-tweeted all over the world in a couple of hours. Using publicly available data, and free software. We haven’t even begun to figure out all the ways in which the internet, social media, open source and open access is changing the way the world works.
This is another nail in the coffin for traditional academic publishing. If and when Scacco and Beber write this up as a journal article, the results won’t be publicly available until it goes through the rigorous academic review cycle, which could be anywhere from a few months to several years. But the events in Iran are happening now. Protestors are dying now. Publishing these results in a respected newspaper gets the results out much more quickly.
Furthermore, the authors made the data and statistical analysis code available for anyone in the world to download and run (remember, its all free software, and free data). This means that instead of a couple journal editors validating their work, they’ve effectively crowdsourced the review process. Brilliant. I don’t know how traditional academic journals can continue to pretend to be socially relevant in a world where anyone can make their work public, that work can be independently verified in minutes instead of years.
This tag cloud was generated from all the paper titles that were presented at the ICWSM ’09 conference (http://www.icwsm.org). I don’t think anyone is surprised that ‘social’ is the major term.
This tag cloud was generated from my liveblog of the ICWSM conference (http://fitzgeraldsteele.wordpress.com/tag/icwsm). I think it is interesting that people shows up as the biggest term here, where it hardly registers in the paper titles.
Tagcloud generated by http://www.wordle.net
Intersection of news media, technology, and the political process. Modern SM technology is a disruptive technology, similar to radio/TV in the 20th century. How does information transmitted broadly by the media interact with the personal influence arising from social networks?
SM erases difference between global and local influence, making more of a continuum. Speed of media reporting increasing, contributing to a 24 hour news cycle. “A Challenge to healthy discourse.” Online media also adds complexity to how political info flows through social networks.
The dynamics of the global news cycle
Examined if the ‘news cycle’ is a metaphorical construct, or is it visible in data. If it’s visible, can we measure it, describe it? Used data from Spinn3r, looked at 1M news articles and blog posts per day, 20K sources.
What basic “units” make up the news cycle? Need some aggregate of articles, vary over the order of days, and handles half-terabyte of data. Look for “memes”, identify text fragments, phrases, quotes that travel through many articles. They create a weighted, directed, acyclic graph of mutational variants, that delentes min total edge weight such that each component has a single “sink” node. This problem is NP-hard, but can apply heuristics based on selecting a single edge out of each quote. Produces a neat stacked histogram graph that shows the relative frequencies of stories related to a particular quote over time.
Use some analogies to describe temporal variations: eg species competing for a resources in an eco system, or biological systems that synchronize to favor a small number of individuals at any point in time. A model to describe this might include: imitation term, recency term.
Found a 2.5 hour gap between peak intensity of the story in mainstream media, vs when it peaked in the blogs.
Can also use the data to find stories where blogs lead the media.
The spread of political messages through social networks
Might look at Chain-letter petitions as ‘tracers’ through global social network. These are good because 1) they are viral – only get via email, 2) comes with its own tracer (signatures on it). Can’t see the full tree, but copies get posted to mailing lists, which can be found by search engine. So they can build a partial tree, compensating for the mutations in the signature tree.
It turns out genetic mutation analogies are good…all kinds of mutations happen (people erase names, put funny names in the middle, etc).
Built the tree from two chain letters, and it looked funny. If we’re in a small world network (six degrees of separation), why is the tree very deep and narrow, like a depth-first search tree. Why? Possible timing effects, assuming that nodes act on messages according to some delay.
So we can make some initial analogies like mutation, biology. But these are really complex, global phenomena, that require richer models and knowledge of human behavior. Ideas from computing and online media will be crucial to the next steps.
Stochastic Models of User-Contributory Web Sites
Interested in modeling how to people view and rate existing content. The talk is an extended example using Digg.
Votes on stories is a combination of visibility (do they see the story) and interest (do they like it, vote on it). In this experiement, they don’t have info on visibility so they need to model it.
Their model captures key Digg qualitative features: slow initial, fast growth as it gets more views.
A model for promotion of an article is created.
Stochastic process approach used to connect user and system behaviors. Applies to users with limited information and tasks
Personal Information Management vs. Resource Sharing: Towards a Model of Information Behavior in Social Tagging Systems
Why do people tag? Towards a model of tagging as info interaction behavior.
Is tagging a way to get around the vocabulary problem (different communities, different terms)
Emerging tag models for Language (Linguistic Tag model), function, tag-relationship. Found almost all tags relate to content, not time, task, emotion
TACS – web based tool for tag analysis
Used Amazon Mechanical Turk as a cheap way to get survey subjects, although there may be some problems (verification, biased population, platform)
Assume different motivations for tagging. Organizing your own content (PIM) vs Media and information sharing.
Designed a questionnaire of Delicious, Connotea, Flickr, YouTube users 7pm Likert scale
Qualitative analysis showed strong differences in motivation for using different sites.
Ease of tagging not significantly different. Tagging is useful (connotea users really think so).
Compares to Shneiderman 2002 Two dimensions of social interaction (activity vs. social sphere)
In terms of IR, people thought tags on flickr/youtube were more helpful than delicous/connotea. I’m surprised by that…I use tags on delicous to locate information all the time. For me, its one of the key features. When I asked the speaker, he said his qualitative/quantitative results had no indication of that type of behavior. I think that’s really interesting. Time for a paper?
Activity Types (Cool & Belkin 2002). May be applicable, but lacks a social dimension.
Motivation, Structure, and Tenure Factors that Impact Online Photo Sharing
Why do people in online communities share? Photos, info, meta-information, code. Want to quantify drivers for sharing and actual behavior. Can look at the area in terms of WHY people share, WHAT they share, and WHERE.
Note a difference between creating and sharing. They are separate, but many studies assume creation is coupled with sharing. Looked at Flickr data; combined survey data with system reported data.
Looks at 3 factors: Motivation (Intrinsic vs Extrinsic, Self vs. others)
Structure: Number of contacts
Tenure: Years since started sharing
Looked at artifact sharing per year tenure.
I wonder why they went shares/year, not per month. Seems like you could really see different outcomes for people that post habitually, vs people that share their one time trip.
Commitment, Number of contacts positively correlated with sharing. Personal Enjoyment is not correlated (maybe because people motivated by creating more than sharing). Self-development is negatively correlated with sharing (maybe because they are more interested in quality than quantity). Time since first upload strongly negatively correlated with sharing (the longer you’re with a community, the less likely you are to share). Maybe because of loss of interest.
Modeling Blog Dynamics
The blogosphere is a system of interactions of posts, topics, links, etc. The purpose here is to create a generative model of the blogosphere that matches properties of the real blogosphere for prediction and motivation.
Actually 2 networks combined into one: Blog network and post network.
Goal: Model micro-level interactions to create the macro-level patterns (structure, and dynamic over time) of the blogosphere.
Structure/Topological Patterns: Power Laws (interposting time)
Temporal/Dynamic Patterns: Burstiness and Self-similarity
Proposed Model: ZC
In every timestep, for every blog, assign a state as part of an FSM, depending on how likely they wil blog. If they blog, randomly decide if they will create a link to a neighbor or ‘random blog’.
This creates a post distribution, burstiness, post popularity similar to real blogosphere.
System Design and Community Culture
The role of rules and algorithms in shaping human behavior
- Lukas Biewald, Dolores Labs
- Rashimi Sinha, Slideshare
- Cameron Marlow, Facebook
Dolores Labs – Making Crowds Efficient and Reliable. They pay people to perform tasks, aka Amazon Mechanical Turk.
Slideshare – Focus on social design. Presentations are fundamentally social – you don’t make them for yourself. The social networking tools (commenting, favoriting, tagging) has lead to the creation of a community.
Facebook – Runs the Data Science team, which uses machine language and research to understand how users use the site, and that leads to design changes.
Examples of Unexpected Community Behavior?
RS: What gets spam, what does not. Particularly in their comment system. They went through lots of iterations
LB: Prompting a task affected the outcome. So now they work with people to define
What sort design decisions are based on difficulty?
LB: Try to break a task into the smallest possible unit.
RS: Presentations are less frequent than say photos, so there are different rules. Also differentiate between user types: content creator, readers, aggregators.
CM: Facebook isn’t really designed around a task. They do lots of things to enable use at different levels.
Range of tasks across the three systems. How do you learn how social interactions change tasks?
RS: Observed real life events (people gather around a presentation). Create a unit, and a construct around that.
CM: FB tries to lower the barrier of trying now tasks. For example, someone can upload a photo, others can tag photos, add metadata, etc.
Design by Intuition vs. Design by Data. What is your approach/process in developing new features?
RS: Start with intuition, primary hypothesis. Look at what data in the world. Once its up, there’s lots of data to see what people like, what people talk about. Also do AB Testing.
LB: Can nicely segment users along whatever dimension you want, so you have lots of options.
CM: People react to change. Some like it, some hate it. What fraction of the population respond to the change.
We know you can prompt people to get certain types of behavior. How do you compensate for that?
RS: Not so worried about that — doesn’t have to be scientific. Of course, you can also do experiments to deal with it.
CM: There are many sources of bias in these large ecosystems. Important that decision makers know about them.
Community, communicate, share. What makes for successful conversation?
CM: Allow them to happen at a different scale, use aggregated tools to understand entire conversation. For example, they have a tool that can find a term/keyword across all of Facebook, as a percentage of all text. Helps them make sense (in some small way) of everything.
RS: Twitter hashtags are a really good, scalable way to communicate a topic. Well, maybe partially scalable. When a hashtag makes twitter trending topics, bots take over. But things are good up until then.
How do people discover your content, features?
RS: Email, social network links, but mainly Google search
CM: The Wall. Now have two feeds: 1 real time, 1 algorithm driven.
Twitter innovations: #hashtags, @replies, ReTweets – users came up with those. How do you design so that users can extend the design on their own?
RS: Initial version of Slideshare was barebones. Keep the initial design to the core, get feedback, refine. Build new features based on what works. Also, develop and API so people can extend your site.
CM: Design a platform so that people can build their own specific tools.
How do you enable the conversation/feedback between designers and community? How do you differentiate edge case complaints vs real problems.
LB: Designers do customer support
RS: Ditto. Also, use numbers, percentages of people that complain.
CM: Collect as many signals as possible. If something shows up across many areas, it may be a real problem.
A Categorical Model for Discovering Latent Structure in Social Annotations
This paper describes a model to the structure of semantic topics over documents using tags.
They propose a community-based categorical annotation model. Communities form around interests, expertise, language, etc. Each community has a number of categories as its world view. Therefore the community draws tags about a document from its list of categories. Use Gibbs Sampling to recover communities and categories. This gives you a distribution of communities and categories.
Used a corpus from flickr and delicious.com to do experiments.
First looked at similarity between content-based topics and tag-based topics. No real similarity.
Propose an example application: Information Access via Topic and Category. Given a particular page, you can see pages that are similar in content, tags, both, or neither.
My question is about a person’s membership in a community. When I look at an object, I’m influenced by the tags…it becomes part of my vocabulary. So really I have a distribution of membership in the different categories. I’m not sure if/how this work takes that into account.
Content Based Recommendation and Summarization in the Blogosphere
Given a set of blogs related to a topic, find a subset of blog feeds to read that have interest in the topic. Previous work is on link popularity (PageRank, HITS). This one looks at content similarity. It builds a blog post network graph, where directed and weighted edges indicate links from one post to another). Blog importance defined as the importance of the adjacent blogs. Post similarity defined by TF-IDF.
This also defines a diversity ranking, and discounts nodes that are too similar to previously selected nodes. It also adds a user-defined quality factor.
Experiments used BLOG2006 dataset. Calculate node quality using a Linear Threshold diffusion model. Compared this algorithm to a random selection of blogs, a simple heuristic for selecting blogs, and a greedy algorithm.
Leskovec has a good paper on selecting interesting blogs using a diffusion model. It takes into account the fact that there are a lot of repeated stories/topics — I don’t need to read all of them. I wonder if he could compare this model to that one.
Also, I’m not clear is to whether he’s selecting whole blogs or blog posts. The algorithm is based on blog posts, and that may not be very accurate. For example, the most popular post on this blog (by far) is about Pinax deployment, but I haven’t really written about that much in the last year. Would this entire blog be flagged with this algorithm?
I think there are a lot of techniques for solving this problem. I propose to evaluate them by accumulating the blogs of the social media researchers
Targeting Sentiment Expressions through Supervised Ranking of Linguistic Configurations
Sentiment Expressions – single or multi-work phrases that express evaluation. Assumes binary polarity (positive, negative).
Target – word/phrase which is the object of evaluation. Sentiment expressions only like to physical targets.
So, given annotated mentions of sentiment anaylsis, find the target. Did lots of manual annotations to come up with a gold standard against which they can test the targeting system (I feel for those summer interns).
Baseline approach: proximity. Nearest mentione selected as target.
Baseline 2: Run a dependency parcer.
We use a supervised ranking. Build independent classifiers for each sentiment expression/possible target pair, and a ranking algorithm to help select between them. Classifiers trained using RankSVM.
Results: better than baselines. 80% precision/recall/f-score are the human max: this system is around 70%. This approach definitely beats bag of words, proximity.
Duncan Watts knows about Social Networks. Looks at the web as a tool for doing social science research. Observing individual behavior, and interactions recorded in real time, for large populations. Can the web lead to a social science revolution?
This talk will describe four studies, each motivated by a substansive science question, and illustrate use of new technology that would not have been practical a few years ago.
Dynamic Network Analysis
We typically look at static networks. If we look at how networks change over time, we can ask different questions
- What factors affect tie formation/termination?
- How do network properties change?
- How constrained are individual choices
Used an Email Data network in a university community: 14.5M emails over 9 months, 43K members
Found that Structure Drives Evolution — network proximity overwhelming determines new ties. This type of data can be used to build empirical models of network formations. Can also look at network stability of various properties over time. However, found that individual rankings change.
Web-Based “Macro-Sociological” Experiment on Social Influence
Why are cultural “hits” (eg, Harry Potter, Titanic, Michael Jackson) so much more successful than average, yet so difficult to predict? It may have to do with “Social Influence.” If everyone is influencing each other, what effects does it have on the market overall?
You couldn’t really do this type of experiment (comparing individual and marketwide) studies in a lab before the web (physical constraints, can’t go back in time)
Created a web based “Cultural Market” Music Lab. Subjects where shown a grid of 48 songs by unknown bands,. You can listen to a song, and decide to download it. Subjects randomly assigned to 2 conditions: see download counts (social influence), or no download counts. Also broke the study into 8 different ‘worlds’.
Found social influence at individual level: people are more likely to click on songs in the top 10.
Calculated properties at Collective Level. Inequality of Success (Gini Coefficient), Unpredictability of Success (average difference in market share of songs across R realizations of the world). Found that when people know what other people think, unpredictability goes up, inequality of success higher than control. Also found that ‘best’ songs never do poorly, ‘worst’ songs never do great.
Broader Impacts: Relevant whenever people base decisions on observations of others. Market
s do not simply reveal stable underlying preferences. Institutions based on “Representative Individual” models may need to be revised.
Network Survey on Facebook
Some evidence that Americans increasingly group themselves with like minded individuals, some contrary evidence. One hypothesis is that people are not as similar to their friends as they think. This type of study is really hard to put together, execute. Facebook made it easier.
Created “Friend Sence” Application. Application asks you what does your friend think about various questions. Got 2500 respondents, 12,160 complete dyads. This is a known biased population, BUT A traditional study would have cost 200-300K and 2 years. On Facebook, it took 2-3k, a couple months.
Results: Friends are more similar than strangers, but not as similar as they think. It turns out people are unaware of much of what their friends think.
How Do Financial Incentives Affect Performance
Assumption that performance-based pay should result in better work than fixed pay. This study asks whether an employer can elicit better performance from a given wooker pool by paying them more.
This study looked at web-based peer production (wikipedia, Y! Answers, Digg). Used Amazon Mechanical Turk for crowd sourcing labor. Subjects accept a job, receive an up front fee. They are sent to another site, where they are assigned a task, and will receive a bonus for doing well. Randomly assigned to 3 pay levels (low, medium, high).
Results: Subjects do more work for more pay. Also, do less work for harder tasks. However, found that increasing pay did NOT improve accuracy.
Also, found that people always thought they were being underpaid (in post questions, people said they should be paid more).
Tentative suggestion: Payment levels should be dictated by recruitment and retention, not direct impact on quality. (lots of caveats on the results of this study).
Concludes that social network platforms offer social scientists new tools to study social interactions, collective dynamics.
Lots of exciting progress in “network science”
- physics, computer sicence, sociology, economics
- massive scale
- network experiments
- Large, observational studies
Fundamental advances will require new approaches. Need to study lage populations of individuals, plus interactions and behavior over extended time.
Social Science 2.0: Should address macro-level phenomena from a bottom up understanding of micro-level social interactions.
- scale of data, experimentation, platforms still ufeasible for lone researchers
- shared data storage, experimental platforms, subject pools needed
- human subject research on the web = privacy and consent issues
I really liked this discussion because it relates some of the technical discussions at this conference (network analysis, data mining, etc) with larger social issues.
You are currently browsing the archives for the research category.
- @expensify Also very happy with how quickly your support answered my "help! I deleted the wrong thing" email. 4 days ago
- @expensify Your undelete made me a very happy bunny today! 4 days ago
- @KnightStormGame any updates on the update? 1 month ago
- @CenturyLinkHelp No, we're back up and running. Thanks for the followup. 1 month ago
- @CenturyLinkHelp Ahh...your IVR said there are connection issues in my area. That, and the 'email when fixed,' are very helpful. Thanks! 1 month ago