Turkers’ guidelines for academic requesters on Amazon Mechanical Turk

If you’ve spent time talking with Turkers, you probably know that academic requesters have been a continuous source of strain. Research surveys with horrendous pay and arbitrary rejections are common. Despite Mechanical Turk’s attractive availability, a large number of researchers make innocent missteps and cause serious stress. Recently, the tension came to a head on Turkopticon. An IRB-approved researcher experimented on the platform unannounced. The result was Turker confusion, strife, and wasted time, in a system where time is what it takes to make ends meet.

Turkers have had to deal with research problems on a case-by-case basis through e-mail or by calling human subjects review boards (e.g. IRBs, HRPPs) for help. Now, a collective of Turkers and researchers have created guidelines making Turkers’ expectations and rights available in advance to mitigate these tensions from the start. They address how to be a good requester, how to pay fairly, and what Turkers can do if HITs are questionable. They apply to Turkers both as experimental subjects or data processing workers who fuel academic research.

We’ll publicly maintain these guidelines so IRBs and researchers can easily find them, and Turkers can easily point to them in advocating for themselves.

Read the guidelines: http://guidelines.wearedynamo.org

They were developed over several weeks, and have been circulated and debated by workers. Turkers have been signing it to show their support.

As a requester, you are part of a very powerful group on AMT. Your signature in support of this document will help give Turkers a sense of cooperation and goodwill, and make Mechanical Turk a better place to work.

Today is Labor Day, a US holiday to honor the achievements of worker organizations. Honor Turkers by signing the guidelines as a researcher, and treating Turkers with the respect they deserve.

If you have any questions, you can email them to info@dynamo.org or submit a reply to this post.

- The Dynamo Collective

Emerging Dynamics in Crowdfunding Campaigns

image14

Recent research has shown that, in addition to the quality and representations of project ideas, dynamics of investment during a crowdfunding campaign also play an important role in determining its success.  To further understand the role of investment dynamics, we did an exploratory analysis by applying a decision tree model to train predictors over the time series of money pledges to campaigns in Kickstarter to investigate the extent to which simple inflows and first-order derivatives can predict the eventual success of campaigns.

Figure 1

Figure 1: Prediction accuracies over time by using the values of money inflows and the selected significant time before cur- rent time

The results based on the  the values of money inflows are shown in Figure 1:

  • As expected, the performance of the predictors steadily improves.
  • With only the first 15% of the money inflows, out predictor can achieve 84% accuracy.
  • The most “active” periods could be around the first 10% as well as between the 40-60%.
Figure 2

Figure 2: Prediction accuracies over time by using the derivative of money inflows and the selected significant time before current time

The results based on the the derivative of money inflows are shown in Figure 2:

  • The performance of the predictors does not increase much until the very last stage.
  • The most important period also does not change until the end, jumping from 5% to 100%.

So according to the above results, we reach the conclusion:

  • The periods around 10% and 40%-60% during a campaign had a stronger impact.
  • “Seed money” (init 15% money inflow) may probably determine the final result of a campaign.
  • Don’t give up and you can still make it at the very end of the campaign.

For more, please see our full paper, Emerging Dynamics in Crowdfunding Campaigns.

Huaming Rao, Nanjing University of Science & Technology and University of Illinois at Urbana-Champaign
Anbang Xu, University of Illinois at Urbana-Champaign
Xiao Yang, Tsinghua University
Wai-Tat Fu, University of Illinois at Urbana-Champaign

Community-Based Bayesian Aggregation Models for Crowdsourcing

A typical crowdsourcing classification scenario is where we wish to classify a number of items based on a set of noisy or biased labels that were provided by multiple crowd workers with varying levels of expertise, skills and attitudes. To obtain the set of accurate aggregated labels, we must be able to assess the accuracies and biases of each worker who contributed labels. Ultimately, these estimates of the workers’ accuracy should be integrated within the process that infers the items’ true labels.

Prior work on the data aggregation problem in crowdsourcing led to an expressive representation of a worker’s accuracy in the form of a latent worker confusion matrix. This matrix expresses the probability of each possible labelling outcome for a specific worker conditioned on each possible true label of an item. This matrix reflects the labelling behaviour of a given user, who may, for example, be biased towards a particular label range. See the example below for a classification task with three label classes (-1,0,1).

Bad workerGood worker

 

 

 

 

In CommunityBCC, we make a further modelling step by adding a latent worker type variable, which we call community. Communities  represent similarity patterns among the workers’ confusion matrices. Thus, we assume that the workers’ confusion matrices are not completely random, but rather that they tend follow some underlying clustering patterns – such patterns are readily observable by plotting the confusion matrices of workers as learned by BCC. See this example from a dataset with three-point scale labels (-1, 0, 1):

Clusters

The CommunityBCC model is designed to encode the assumptions that (i) the crowd is composed by an unknown number of communities, (ii) each worker belongs to one of these communities and (iii) each worker’s confusion matrix is a noisy copy of their community’s confusion matrix. The factor graph of the model is shown below and the full generative process is described in the paper (details below).

CBCC

How to find the number of communities
For a given dataset, we can find the optimal number of communities using standard model selection. In particular, we can perform a model evidence search over a range of community counts. So, if we assume that the community count lies within a range of 1..x communities, we can run CommunityBCC by looping over this range and compute the model evidence of each community count. This computation can be done efficiently using approximate inference using message passing. For an example, take a look at computing model evidence for model selection using the Infer.NET probabilistic programming framework here.

Evaluation
We tested our CommunityBCC model on four different crowdsourced datasets and our results show that it provides a number of advantages over BCC, Majority Voting (MV) and Dawid and Skene’s Expected Maximization (EM) method.

  • CommunityBCC converges faster to the highest classification accuracy using less labels. See the figure below where we iteratively select labels for each dataset.
    CBCCAccuracy
  • The model provides useful information about the number of latent worker communities. See the figure below showing the communities and the percentage of workers estimated by CommunityBCC in each of the four datasets.CBCCCommunities

To learn more about Community-Based Bayesian Aggregation Models for Crowdsourcing, take a look at the paper:

Matteo Venanzi, John Guiver, Gabriella Kazai, Pushmeet Kohli, and Milad Shokouhi, Community-Based Bayesian Aggregation Models for Crowdsourcing, in Proceedings of the 23rd International World Wide Web Conference, WWW2014, Best paper runner up, ACM, April 2014

Full code for this model
The full C# implementation of this model is described in this post where you can download and try out its Infer.NET code. You are welcome to experiment with the model and provide feedback.

Matteo Venanzi, University of Southampton
John Guiver, Microsoft
Gabriella Kazai, Microsoft
Pushmeet Kohli, Microsoft
Milad Shokouhi, Microsoft

 

 

 

 

 

 

 

 

 

Posted in -

Social influence in not-so-social media: Linguistic style in online reviews

Language is not only the means through which we express our thoughts and opinions, it also conveys a great deal of social information about ourselves and our relationships to others. Linguistic accommodation is often observed in face-to-face and technology-mediated encounters.

The social identity approach is typically invoked to explain such phenomena: we adjust our language patterns in order to be more in sync with the patterns of others with whom we identify. What happens though, in a social medium that isn’t really all that social? Do we still observe evidence of influence on participants’ linguistic style?

We studied reviewers’ language patterns at TripAdvisor review forums, where there is no direct interaction between participants. We identified several stylistic features that deviate from the medium’s “house style,” in the sense that their use is very rare, for example:

  • Second person voice (only 7% of reviews in our data set incorporate this feature)
  • Emoticons (3%)
  • Markers of British vocabulary (3%)

We examined the hypothesis that reviewers are more likely to incorporate unusual features in their reviews when they are exposed to them in their local context (i.e., the preceding reviews submitted on the same attraction). Our hypothesis was supported for most of the features we examined.

For instance, the figure below shows the probability of a reviewer writing in the second person voice as a function of increasing exposure to this feature. Specifically, the horizontal axis shows the proportion of the 7 immediately preceding reviews manifesting the feature; the vertical axis is the proportion of current reviews incorporating the feature, given the extent of exposure. It is clear that with increasing exposure to the unusual feature, the reviewer is more likely to deviate from the general “house style,” and follow suit with the previous reviews. In fact, beyond a given level of exposure, it becomes almost certain that the current review will also manifest the rare feature.

0-3-2-7

Our paper presents experiments on 12 such linguistic features, and offers preliminary evidence that even in the absence of direct, repeated interaction between social media participants, linguistic accommodation can occur. Thus, herding behaviors in language may come about through the process of reading and writing alone.

Audience design offers a possible explanation for our observations. It may be that due to the lack of direct interaction at TripAdvisor, participants form a perception of their audience based primarily on the previously contributed reviews, adjusting their writing style accordingly. This explanation resonates with recent work on the particular properties of social media audiences (e.g., the imagined audience and context collapse.)

However, further work must tease out the possible influence of external factors, such as attraction-specific or seasonal characteristics. The present work establishes a correlation between local context and the use of linguistic features, but not necessarily a clear-cut causal relationship.

Michael, Loizos, AND Otterbacher, Jahna. “Write Like I Write: Herding in the Language of Online Reviews” International AAAI Conference on Weblogs and Social Media 2014. Available at: http://www.aaai.org/ocs/index.php/ICWSM/ICWSM14/paper/view/8046.

Methodological Debate: How much to pay Indian and US citizens on MTurk?

This is a broadcast search request (hopefully of interest to many readers of the blog), not the presentation of research results.

When conducting research on Amazon Mechnical Turk (MTurk) you always face the question how much to pay workers. You want to be fair, to incentivize diligent work, to expedite recruiting, to sample a somehow representative cross-section of Turkers etc. For the US, I generally aim at $7.50 per hour, slightly more than the minimum wage in the US (although that is non-binding) and presumably slightly higher than the average wage on MTurk. Now I aim for a cross-cultural study comparing survey responses and experiment behavior of Turkers registered as residing in India with US workers. How much to pay in the US, how much in India? For the US it is easy: $7.50 * (expected duration of the HIT in minutes / 60). And India?

The two obvious alternatives are

  1. Pay the same for Indian workers as US workers: $7.50 per hour. MTurk is a global market place in which workers from many nations compete. It’s only fair to pay the same rate for the same work.
  2. Adjust the wage to national price level: ~ $2.50 per hour. A dollar is worth more in the US than in India. Paying the same rate leads to higher incentives for Indian workers and might bias sampling, effort, and results. According to The World Bank, the purchasing power parity conversion factor to market exchange ratio for India compared to the US is 0.3 (http://data.worldbank.org/indicator/PA.NUS.PPPC.RF). $7.50 in the US would make $2.25 in India. Based on The Economist’s BigMac index one could argue for $2.49 in India (raw index) to $4.5 (adjusted index; http://www.economist.com/content/big-mac-index). According to (Ashenfelter 2012, http://www.nber.org/papers/w18006) wages in McDonald’s restaurants in India are 6% of the wage at a McDonald’s restaurant in the US, which could translate to paying $0.45 per hour on MTurk. Given the wide range of estimates, $2.50 might be a reasonable value.

What should be the criteria to decide and which of these two is better?

I appreciate any comments and suggestions and hope that these will be valuable to me and to other readers of Follow the Crowd.

Improving recommendation by directing the crowd’s attention

We are drowning in content. On YouTube alone, over 100 hours of video are uploaded every minute. Which of them are worth watching? Which of the thousands of news stories and discussions on Reddit are worth reading? Which Kickstarter projects are worth funding? To identify quality items, content providers aggregate opinions of many, for example by asking people to recommend interesting items, and prominently feature highly-rated content. In practice, however, peer recommendation often creates “winner-take-all” and “irrational herding” behaviors with inconsistent, biased and unpredictable outcomes in which items of similar quality end up with wildly different ratings.

Researchers from USC Information Sciences Institute and Institute for Molecular Manufacturing demonstrated that it is possible to overcome these limitations to improve the ability of crowds to identify interesting content. Due to human cognitive biases, people pay far more attention to items appearing at the top of a web page than those in lower positions. Hence, the presentation order strongly affects how people allocate attention to the available content. Using Amazon Mechanical Turk, researchers demonstrated that they can manipulate the crowd’s attention through the presentation order of items to improve peer recommendation. Specifically, the common strategy of ordering items by ratings does not accurately estimate their quality, since small early differences in ratings become amplified as people focus attention on the same set of highest-rated items.  This “rich-get-richer” effect occurs even when the ratings are not explicitly shown, but are simply used to order the items.

In contrast, ordering items by the recency of rating, much like a Twitter stream with the most recently retweeted posts at the top of the stream, leads to more robust estimates of their underlying quality and also produces less variable, more predictable outcomes. Ordering items by recency of rating is also a good choice for time critical domains, where novelty is a factor, since continuously moving items to the top of the list can rapidly bring newer items to crowd’s attention.

PlosOne-fig

By judiciously exposing information about the preferences of others, for example, by changing the presentation order, content providers can better leverage the “wisdom of crowds” to accelerate the discovery of quality content.

Lerman K, Hogg T (2014) Leveraging Position Bias to Improve Peer Recommendation. PLoS ONE 9(6): e98914. doi:10.1371/journal.pone.0098914

 

Kristina Lerman, USC Information Sciences Institute

Tad Hogg, Institute for Molecular Manufacturing

Posted in -

The good, the nerd or the beautiful: who should I choose to work with me?

During our lives, we perform collaborative tasks in a wide and diverse range of activities, such as selecting students to participate in a school project, hiring employees to a company or picking up players for a football friendly match.

Given this context, we ask: what factors influence such decisions, i.e., what factors are determinant for selecting/repelling someone for a given collaborative task?

motivation

Without much thought, one could answer this fundamental question by saying that the skill of a person to do the task determines if she/he will be selected for a collaboration. Although we agree proficiency definitely plays an important role in the decision, we again ask: is proficiency the only determinant factor? If not, is proficiency even the main factor?

From a very careful an particular experiment conducted in a classroom of undergrad students, we mixed data from an offline questionnaire with Facebook data to reveal a number of interesting and sometimes surprising findings:

  • the most skilled students were not always preferred;

  • a number of social features extracted from Facebook (see table bellow), such as the strength of the friendship, the popularity of the individual on Facebook, if she is extrovert, and her similarity with other students, are more informative than the grades to determine the willingness of students to work together.

features

Our findings show:

  • the importance of building up a wide and diverse personal profile when the aim is to be selected for a given collaborative task;

  • that online social network data can indicate if two individuals would like or not to work together and, as it is well know, social chemistry is desirable for achieving maximum performance of a team;

  • a potential to leverage several online applications, such as team and collaboration recommendation systems that highlight potential fruitful collaborations and hide collaborations between potential conflictual relationships.

Douglas D. Castilho, Universidade Federal de Minas Gerais, Brazil

Pedro O.S. Vaz de Melo, Universidade Federal de Minas Gerais, Brazil

Daniele Quercia, Yahoo! Labs, Barcelona

Fabricio Benevenuto, Universidade Federal de Minas Gerais, Brazil

Early adopters of Twitter and Google+: Validation of a theoretical model of early adopter personality and social network site influence

The widespread adoption of social media is transforming the consumer-brand relationship. Social media is allowing consumers connect with other users, create, consume and control access to content (Hoffman and Novak, 2012). Research suggests that social media increases brand relationship depth and loyalty, and generates incremental purchase behaviour (Laroche et al., 2012; Kim and Ko, 2012; Pooja et al., 2012). It is not surprising therefore that commentators suggest that marketers should target social media users who are more likely to exert an influence on their network in order to facilitate brand recommendations (Iyengar, Han, & Gupta, 2009). But who are these influentials? Goldenberg et al. (2009) suggest that there are only two types of influential that impact information diffusion – innovators and followers.

influence

Our study looks at early users or in Goldenberg at al.’s terminology, innovators, of two social networking sites, Twitter and Google+, and the effects of personality and mode of information sharing on social influence scoring. Specifically, we look at:

1. How does (i) extraversion, (ii) openness and (iii) conscientiousness influence:

  • Information sharing behaviour
  • Rumour sharing behavior

2. How does (i) information sharing behaviour and (ii) rumour sharing behaviour impact social network site influence scores?

Early Twitter users were identified through a public list and through the joining date listed on user public profiles. As the study occurred during the Google+ closed field test, all users were deemed early users. Two discrete survey instruments were designed, one for Twitter and one for Google+ to provide for different SNS validation checks. To assess the personality traits of respondents, we tested extraversion, openness and conscientiousness with the scale of Gosling et al. (2003) while information and rumour sharing scale were extracted from Marett and Joshi (2009). The SNS score was the independent variable in our model and this was measured using two commercial SNS influence score providers, PeerIndex and Klout.

Our study hypothesized that that Extraversion and Openness were two personality traits that should positively influence both Information and Rumor sharing behavior (H1 and H2), while Conscientiousness would have a reverse effect on Information (+) and Rumor (-) sharing behavior (H3 and H4). We also hypothesized that both Information and Rumor sharing behavior should positively influence social network influence scoring. A structural equation model using AMOS was used to test these hypotheses.

Results of Structured Equation Model - Standardised Regression Weights and Summary Findings

Results of Structured Equation Model – Standardised Regression Weights and Summary Findings

 

The model suggests:

  • Early users of social network sites who are more extrovert or more open or more conscientious are more likely to share information
  • Information sharing and rumor sharing should be treated as two distinct constructs in the discussion of social network influence.
  • All three traits were negatively related to rumor sharing. Only the effects of extroversion and conscientiousness were significant.
  • Both information sharing and rumor sharing impacted positively and significantly on social network site influence scores.

While previous literature has suggested that it is difficult to identify market mavens (Goldsmith et al., 2006), early users of social media can be identified easily and conveniently. This may provide firms with the opportunity to target potential innovators and early adopters much more efficiently and thus accelerate diffusion of marketing messages. Our study suggests filtering these adopters by messaging behaviour may also be of assistance with a greater of emphasis of resources being placed on those social network users who share information rather than rumor. While identifying these potential influencers would seem to be more efficient than identifying mavens, further research is required to understand the most effective way and time to engage with them. Finally, it would seem social network influence scores provide useful signals for identifying social media users likely to share information. Social media users characterised by a combination of high influence scores and propensity for information sharing are powerful assets for firms, particularly if they have relatively large social networks. Engaging with these influencers represents a relatively low cost mechanism for indirectly reaching target markets through word of mouth on social networks.

The research was conducted by Dr Theo Lynn (DCU Business School), Dr Laurent Muzellec (UCD), Dr Barbara Caemerrer (ESSCA), Prof. Darach Turley (DCU Business School) and Bettina Wuerdinger (DCU Business School).

A More Paradoxical Paradox

Have you ever checked your Facebook and Instagram and felt that your friends have more interesting lives? You’re not alone! In fact, that’s one of the consequences of Friendship Paradox, which states that on average, your friends have more friends than you do. Recently, researchers demonstrated that network paradoxes hold not only for popularity, but other traits as well, such as activity and virality of content received.

Beach

A variety of paradoxes exist in online social network such as Twitter and Facebook: Your friends, on average, have more friends, are more active, and post more popular/interesting content compared to you. Image source: https://flic.kr/p/5QXd9M

We recently showed that the standard version of the paradox, using the mean of friends’ values of the trait, arises trivially from the properties of statistical sampling from a heavy-tailed distribution. Social traits, such as popularity or activity (e.g., number of posts made), often have a “heavy tail”, where extremely large values, e.g., very popular people, appear much more frequently than expected compared to a normal distribution. When sampling randomly from such a distribution, the mean of the sample (i.e., mean of friends’ values) will grow with sample size, resulting in paradox. In contrast, the median of the sample does not behave this way and is a more robust measure of the paradox.

Surprisingly, paradoxes persist when median is used: i.e., most of your friends (and followers) have more friends (followers) than you do, and also post and receive more viral and diverse content than you do. In other words, the paradox holds not only for the mean, where a single very popular (or active) friend could skew the average, but also for most friends.

Why do strong paradoxes exist in networks? Since they are not a consequence of sampling, they must have behavioral origin. We hypothesize that they arise due to correlations between individual’s traits and popularity or between traits of connected people (homophily). To test this hypothesis, we performed the shuffle test: we kept the network topology fixed, but permuted traits between nodes in the network. This keeps the distribution of the traits intact, but destroys correlations between people. As expected, we still observe a paradox for the mean in the shuffled network, but not the strong paradox that uses the median.

In short, main findings of our work are

  • We found “strong” paradoxes where most of your friends have more friends than you do, etc.
  • We showed that the paradoxes have a behavioral origin, and not simply the result of statistical properties of sampling from the network.
  • The origin of the paradoxes is in the correlations between traits of nodes and their degree or homophily.

For details, please see our paper “Network Weirdness: Exploring the Origins of Network Paradoxes” http://arxiv.org/abs/1403.7242

Farshad Kooti, University of Southern California
Nathan O. Hodas, USC Information Sciences Institute
Kristina Lerman, USC Information Sciences Institute

Critical Mass of What?

It is often said the online communities need critical mass before they become valuable and sustainable. But do they need a mass of people, or a mass of content?

We compared the relative importance of people in a community to the content in the community by looking at the growth of 1,069 WikiProjects over the first 5 years of their existence.

Effect of 1 year growth in membership contributions and membership on 5 year growth of content

Effect of 1 year growth in membership contributions and membership on 5 year growth of content

We found that the number of people who join a project in its first year predicts the number of contributions that have been made after 5 years. But when we control for the number of people in a community, the number of contributions made in the first year does not affect 5 year growth.

So critical mass in online communities means a critical mass of people. Communities will grow larger in the long term by attracting new members rather than encouraging existing members to participate more.

One complication with this is that participation in online communities is usually unevenly distributed among members, following a power-law or “80-20″ rule where most content is contributed by a small percentage of community members. Can communities build critical mass around a large number of people who only make small contributions, or do they need to find more “power-users” who will be highly active? Is it the power-users that constitute the critical mass?

Method for classifying power and non-power users

Method for classifying power and non-power users

We classified members of each WikiProject as either a power-user or non-power-user and compared the number of contributions made by power-users in the first year to those made by non-power-users. We found that:

  • More contribution from non-power-users early in a project’s life leads to better long-term growth
  • More contribution from power-users leads to slower long-term growth

When power-users do too much early on, they may crowd out potential contributors and community members. Getting more people “in the door” in a community, even if they only make minimal contributions, makes the community more valuable and more productive in the long term.

Online communities that seek growth should design their sites, policies, and incentives to encourage as many individuals as possible to join and make even minimal contributions.

For more, see our full paper, Critical Mass of What? Exploring Community Growth in WikiProjects.
Jacob Solomon, Michigan State University
Rick Wash, Michigan State University