Turkers’ guidelines for academic requesters on Amazon Mechanical Turk

If you’ve spent time talking with Turkers, you probably know that academic requesters have been a continuous source of strain. Research surveys with horrendous pay and arbitrary rejections are common. Despite Mechanical Turk’s attractive availability, a large number of researchers make innocent missteps and cause serious stress. Recently, the tension came to a head on Turkopticon. An IRB-approved researcher experimented on the platform unannounced. The result was Turker confusion, strife, and wasted time, in a system where time is what it takes to make ends meet.

Turkers have had to deal with research problems on a case-by-case basis through e-mail or by calling human subjects review boards (e.g. IRBs, HRPPs) for help. Now, a collective of Turkers and researchers have created guidelines making Turkers’ expectations and rights available in advance to mitigate these tensions from the start. They address how to be a good requester, how to pay fairly, and what Turkers can do if HITs are questionable. They apply to Turkers both as experimental subjects or data processing workers who fuel academic research.

We’ll publicly maintain these guidelines so IRBs and researchers can easily find them, and Turkers can easily point to them in advocating for themselves.

Read the guidelines: http://guidelines.wearedynamo.org

They were developed over several weeks, and have been circulated and debated by workers. Turkers have been signing it to show their support.

As a requester, you are part of a very powerful group on AMT. Your signature in support of this document will help give Turkers a sense of cooperation and goodwill, and make Mechanical Turk a better place to work.

Today is Labor Day, a US holiday to honor the achievements of worker organizations. Honor Turkers by signing the guidelines as a researcher, and treating Turkers with the respect they deserve.

If you have any questions, you can email them to info@dynamo.org or submit a reply to this post.

- The Dynamo Collective

The Human Flesh Search: Large-Scale Crowdsourcing for a Decade and Beyond

Aside

Human Flesh Search (HFS, 人肉搜索 in Chinese), a Web-enabled large-scale crowdsourcing phenomenon (mostly based on voluntary crowd power without cash rewards), originated in China a decade ago. It is a new form of search and problem solving scheme that involves the collaboration among a potentially large number of voluntary Web users. The term “human flesh,” an unfortunately bad translation from its Chinese name, refers to the human empowerment (in fact, crowd-powered search is a more appropriate English name). HFS has seen tremendous growth since its inception in 2001 (Figure 1).

Figure1_updatedFigure 1. (a) Types of HFS episodes, and (b) evolution of HFS episodes based on social desirability

HFS has been a unique Web phenomenon for just over 10 years. HFS presents a valuable test-bed for scientists to validate existing and new theories in social computing, sociology, behavioral sciences, and so forth. Based on a comprehensive dataset of HFS episodes collected from participants’ discussion on the Internet, we performed a series of empirical studies, focusing on the scope of HFS activities, the patterns of HFS crowd collaboration process, and the unique characteristics and dynamics of HFS participant networks. More results of the analysis of HFS participant networks can be found in two papers published in 2010 and 2012 (Additional readings 1 and 2).

In this paper, a survey of HFS participants was conducted to provide an in-depth understanding of the HFS community and various factors that motivate these participants to contribute. The survey results shed light on the in-depth understanding of HFS participants and people involved in the crowdsourcing systems. Most participants voluntarily contribute to HFS, without expectation of money rewards (either real-world or virtual world money).

The findings indicate great potential for researchers to explore how to design a more effective and efficient crowdsourcing system, and how to better utilize this power of the crowds for social goods, solve complex task-solving problems, and even for business purposes like marketing and management.

For more, see our full paper, The Chinese “Human Flesh” Web: the first decade and beyond (free download link; preprint is also available upon request).

Qingpeng Zhang, City University of Hong Kong

Additoinal readings:

  1. Wang F-Y, Zeng D, Hendler J A, Zhang Q, et al (2010). A study of the human flesh search engine: Crowd-powered expansion of online knowledge. Computer, 43: 45-53. doi:10.1109/MC.2010.216
  2. Zhang Q, Wang F-Y, Zeng D, Wang T (2012). Understanding crowd-powered search groups: A social network perspective. PLoS ONE 7(6): e39749. doi:10.1371/journal.pone.0039749

Human Computation now accepting submissions

The interdisciplinary journal Human Computation is now accepting
manuscripts (http://hcjournal.org/ojs/HC-flyer.pdf). The editorial board
is inviting high-quality contributions in the field of Human Computation
from all related disciplines. This is an open-access, community-driven
journal with no author or subscriber fees.

Human Computation is an international and interdisciplinary forum for
the electronic publication and print archiving of high-quality scholarly
articles in all areas of human computation, which concerns the design or
analysis of information processing systems in which humans participate
as computational elements.

The journal aims to bring together Human Computation (HC) results and
perspectives from a wide field of HC-related disciplines, such as
Artificial Intelligence, Behavioral Sciences, Citizen Science, Cognitive
Science, Complexity Science, Computer Science, Evolutionary Biology,
Economics, HCI, Philosophy and others. Subtopics may include novel or
transformative applications of HC, interfaces, methods and task design,
result aggregation and selection methods, mechanism design, software and
infrastructure development, user studies related to HC, and others.

We are now accepting submissions.  The first issue will be published in
early Fall of 2014. To be considered for this issue, submissions must be
received by July 31.  More information on the submission guidelines and
process can be found on the journal site: http://hcjournal.org.

Emerging Dynamics in Crowdfunding Campaigns

image14

Recent research has shown that, in addition to the quality and representations of project ideas, dynamics of investment during a crowdfunding campaign also play an important role in determining its success.  To further understand the role of investment dynamics, we did an exploratory analysis by applying a decision tree model to train predictors over the time series of money pledges to campaigns in Kickstarter to investigate the extent to which simple inflows and first-order derivatives can predict the eventual success of campaigns.

Figure 1

Figure 1: Prediction accuracies over time by using the values of money inflows and the selected significant time before cur- rent time

The results based on the  the values of money inflows are shown in Figure 1:

  • As expected, the performance of the predictors steadily improves.
  • With only the first 15% of the money inflows, out predictor can achieve 84% accuracy.
  • The most “active” periods could be around the first 10% as well as between the 40-60%.
Figure 2

Figure 2: Prediction accuracies over time by using the derivative of money inflows and the selected significant time before current time

The results based on the the derivative of money inflows are shown in Figure 2:

  • The performance of the predictors does not increase much until the very last stage.
  • The most important period also does not change until the end, jumping from 5% to 100%.

So according to the above results, we reach the conclusion:

  • The periods around 10% and 40%-60% during a campaign had a stronger impact.
  • “Seed money” (init 15% money inflow) may probably determine the final result of a campaign.
  • Don’t give up and you can still make it at the very end of the campaign.

For more, please see our full paper, Emerging Dynamics in Crowdfunding Campaigns.

Huaming Rao, Nanjing University of Science & Technology and University of Illinois at Urbana-Champaign
Anbang Xu, University of Illinois at Urbana-Champaign
Xiao Yang, Tsinghua University
Wai-Tat Fu, University of Illinois at Urbana-Champaign

Community-Based Bayesian Aggregation Models for Crowdsourcing

A typical crowdsourcing classification scenario is where we wish to classify a number of items based on a set of noisy or biased labels that were provided by multiple crowd workers with varying levels of expertise, skills and attitudes. To obtain the set of accurate aggregated labels, we must be able to assess the accuracies and biases of each worker who contributed labels. Ultimately, these estimates of the workers’ accuracy should be integrated within the process that infers the items’ true labels.

Prior work on the data aggregation problem in crowdsourcing led to an expressive representation of a worker’s accuracy in the form of a latent worker confusion matrix. This matrix expresses the probability of each possible labelling outcome for a specific worker conditioned on each possible true label of an item. This matrix reflects the labelling behaviour of a given user, who may, for example, be biased towards a particular label range. See the example below for a classification task with three label classes (-1,0,1).

Bad workerGood worker

 

 

 

 

In CommunityBCC, we make a further modelling step by adding a latent worker type variable, which we call community. Communities  represent similarity patterns among the workers’ confusion matrices. Thus, we assume that the workers’ confusion matrices are not completely random, but rather that they tend follow some underlying clustering patterns – such patterns are readily observable by plotting the confusion matrices of workers as learned by BCC. See this example from a dataset with three-point scale labels (-1, 0, 1):

Clusters

The CommunityBCC model is designed to encode the assumptions that (i) the crowd is composed by an unknown number of communities, (ii) each worker belongs to one of these communities and (iii) each worker’s confusion matrix is a noisy copy of their community’s confusion matrix. The factor graph of the model is shown below and the full generative process is described in the paper (details below).

CBCC

How to find the number of communities
For a given dataset, we can find the optimal number of communities using standard model selection. In particular, we can perform a model evidence search over a range of community counts. So, if we assume that the community count lies within a range of 1..x communities, we can run CommunityBCC by looping over this range and compute the model evidence of each community count. This computation can be done efficiently using approximate inference using message passing. For an example, take a look at computing model evidence for model selection using the Infer.NET probabilistic programming framework here.

Evaluation
We tested our CommunityBCC model on four different crowdsourced datasets and our results show that it provides a number of advantages over BCC, Majority Voting (MV) and Dawid and Skene’s Expected Maximization (EM) method.

  • CommunityBCC converges faster to the highest classification accuracy using less labels. See the figure below where we iteratively select labels for each dataset.
    CBCCAccuracy
  • The model provides useful information about the number of latent worker communities. See the figure below showing the communities and the percentage of workers estimated by CommunityBCC in each of the four datasets.CBCCCommunities

To learn more about Community-Based Bayesian Aggregation Models for Crowdsourcing, take a look at the paper:

Matteo Venanzi, John Guiver, Gabriella Kazai, Pushmeet Kohli, and Milad Shokouhi, Community-Based Bayesian Aggregation Models for Crowdsourcing, in Proceedings of the 23rd International World Wide Web Conference, WWW2014, Best paper runner up, ACM, April 2014

Full code for this model
The full C# implementation of this model is described in this post where you can download and try out its Infer.NET code. You are welcome to experiment with the model and provide feedback.

Matteo Venanzi, University of Southampton
John Guiver, Microsoft
Gabriella Kazai, Microsoft
Pushmeet Kohli, Microsoft
Milad Shokouhi, Microsoft

 

 

 

 

 

 

 

 

 

Posted in -

More than Liking and Bookmarking? Towards Understanding Twitter Favouriting Behaviour

Twitter is a widely used micro-blogging platform that offers its users a variety of different features to engage with contacts in their social network and the content they produce. One of this features is the favouriting function a small, star-shaped icon displayed on the bottom of every tweet.

favorite

The usage of favouriting has strongly increased over the years, but in contrast to other Twitter features, such as retweeting or hashtags, favouriting has not been, to date, the focus of any rigorous scientific investigation.

Our work presents an initial study of favouriting behaviour. In particular we focus on the motivations people have for favouriting a tweet. We approach this question via a large-scale survey, which queried 606 Twitter users on the frequency with which they exhibit particular behaviours, including how often they make use of favourite button. Moreover two free form questions asked users about the reasons why they use this function and what they hope to achieve when doing so.

Interestingly only 65% (395 participants) of our respondents reported knowing about the favouriting feature. On the one hand, 26.8% of these participants stated to never favourite a tweet. On the other hand, 36.1% reported favouriting regularly, 5% of participants even reported doing so multiple times per day.

Favouriting

The main result of our study is a coding scheme or classification of 25 heterogenous reasons for using the favouriting feature. The table below shows the complete coding scheme along with frequency information, detailing how often each code appeared in the participants’ answers.

CodingScheme

Our findings show that motivations behind favouriting can be grouped into two major use cases:

  • (A) favouriting is used as a response or reaction to the tweet or its metadata, e.g., by liking it [A3]. Another prominent example is the ego favouriter [A4.2], who favourites a tweet, when he or she is mentioned in it.
  • (B) favouriting is used for a specific purpose or to fulfill a function, e.g. by bookmarking [B1] it in the favourites list. Another example would be agreeing with the author [B2.1], which can be interpreted as a digital fist bump or nod, as form of unwritten communication [B2].

All in all we can see that the favouriting feature is overly re-purposed, revealing unsupported user needs and interesting behaviour.

For a more detailed explanation of codes  and example statements see our full paper, More than Liking and Bookmarking? Towards Understanding Twitter Favouriting Behaviour.

Florian Meier, Chair for Information Science, University of Regensburg, Germany
David Elsweiler,Chair for Information Science, University of Regensburg, Germany
Max L. Wilson, Mixed Reality Lab, University of Nottingham, United Kingdom

Social influence in not-so-social media: Linguistic style in online reviews

Language is not only the means through which we express our thoughts and opinions, it also conveys a great deal of social information about ourselves and our relationships to others. Linguistic accommodation is often observed in face-to-face and technology-mediated encounters.

The social identity approach is typically invoked to explain such phenomena: we adjust our language patterns in order to be more in sync with the patterns of others with whom we identify. What happens though, in a social medium that isn’t really all that social? Do we still observe evidence of influence on participants’ linguistic style?

We studied reviewers’ language patterns at TripAdvisor review forums, where there is no direct interaction between participants. We identified several stylistic features that deviate from the medium’s “house style,” in the sense that their use is very rare, for example:

  • Second person voice (only 7% of reviews in our data set incorporate this feature)
  • Emoticons (3%)
  • Markers of British vocabulary (3%)

We examined the hypothesis that reviewers are more likely to incorporate unusual features in their reviews when they are exposed to them in their local context (i.e., the preceding reviews submitted on the same attraction). Our hypothesis was supported for most of the features we examined.

For instance, the figure below shows the probability of a reviewer writing in the second person voice as a function of increasing exposure to this feature. Specifically, the horizontal axis shows the proportion of the 7 immediately preceding reviews manifesting the feature; the vertical axis is the proportion of current reviews incorporating the feature, given the extent of exposure. It is clear that with increasing exposure to the unusual feature, the reviewer is more likely to deviate from the general “house style,” and follow suit with the previous reviews. In fact, beyond a given level of exposure, it becomes almost certain that the current review will also manifest the rare feature.

0-3-2-7

Our paper presents experiments on 12 such linguistic features, and offers preliminary evidence that even in the absence of direct, repeated interaction between social media participants, linguistic accommodation can occur. Thus, herding behaviors in language may come about through the process of reading and writing alone.

Audience design offers a possible explanation for our observations. It may be that due to the lack of direct interaction at TripAdvisor, participants form a perception of their audience based primarily on the previously contributed reviews, adjusting their writing style accordingly. This explanation resonates with recent work on the particular properties of social media audiences (e.g., the imagined audience and context collapse.)

However, further work must tease out the possible influence of external factors, such as attraction-specific or seasonal characteristics. The present work establishes a correlation between local context and the use of linguistic features, but not necessarily a clear-cut causal relationship.

Michael, Loizos, AND Otterbacher, Jahna. “Write Like I Write: Herding in the Language of Online Reviews” International AAAI Conference on Weblogs and Social Media 2014. Available at: http://www.aaai.org/ocs/index.php/ICWSM/ICWSM14/paper/view/8046.

Information Overload in Social Media and its Impact on Social Contagion

Since Alvin Toffler popularized the term “Information overload” in his bestselling 1970 book Future Shock, it has become ubiquitous in modern society. The advent of social media and online social networking has led to a dramatic increase in the amount of information a user is exposed to, greatly increasing the chances of the user experiencing an information overload. Surveys show that two thirds of Twitter users have felt that they receive too many posts, and over half of Twitter users have felt the need for a tool to filter out the irrelevant posts.

Our goal is to quantitatively characterize the phenomenon of information overload and its impact on information propagation in a social network. To this end, we perform a large-scale quantitative study of information overload experienced by users in Twitter. The key insight that enables our study is that users’ information processing behaviors can be reverse engineered through a careful analysis of the times when they receive a piece of information and when they choose to forward it to other users.

We found several insights that not only reveal the extent to which users in social media are overloaded with information, but also help us in understanding how information overload influences users’ decisions to forward and disseminate information to other users:

  • We find empirical evidence of a limit on the amount of information a Twitter user produces per day; very few Twitter users produce more than ∼40 tweets/day.
  • We find no limit on the information received by Twitter users; many Twitter users follow several hundreds to thousands of other users.blog-fig-1
  • We find a threshold rate of incoming information (∼30 tweets/hour), below which the probability that a user forwards any received tweet holds nearly constant, but above which the probability that a user forwards any received tweet begins to drop substantially (Figure below). We argue that the threshold rate roughly approximates the limit on information processing capacity of users and it allows us to identify overloaded users.twitter-users-2009-07-01-2009-10-01-final-tweet-in-flow-probability-retweet
  • We observe that if a user is overloaded, the higher the rate at which she receives information, the longer the time she takes to process and forward the information. Further, overloaded users tend to prioritize tweets from a subset of sources.
    twitter-users-2009-07-01-2009-10-01-final-median-time-difference-in-flow

For more details, see our full paper Quantifying Information Overload in Social Media and its Impact on Social Contagions.

Manuel Gomez-Rodriguez, MPI for Intelligent Systems and MPI for Software Systems
Krishna Gummadi, MPI for Software Systems
Bernhard Schölkopf, MPI for Intelligent Systems

Methodological Debate: How much to pay Indian and US citizens on MTurk?

This is a broadcast search request (hopefully of interest to many readers of the blog), not the presentation of research results.

When conducting research on Amazon Mechnical Turk (MTurk) you always face the question how much to pay workers. You want to be fair, to incentivize diligent work, to expedite recruiting, to sample a somehow representative cross-section of Turkers etc. For the US, I generally aim at $7.50 per hour, slightly more than the minimum wage in the US (although that is non-binding) and presumably slightly higher than the average wage on MTurk. Now I aim for a cross-cultural study comparing survey responses and experiment behavior of Turkers registered as residing in India with US workers. How much to pay in the US, how much in India? For the US it is easy: $7.50 * (expected duration of the HIT in minutes / 60). And India?

The two obvious alternatives are

  1. Pay the same for Indian workers as US workers: $7.50 per hour. MTurk is a global market place in which workers from many nations compete. It’s only fair to pay the same rate for the same work.
  2. Adjust the wage to national price level: ~ $2.50 per hour. A dollar is worth more in the US than in India. Paying the same rate leads to higher incentives for Indian workers and might bias sampling, effort, and results. According to The World Bank, the purchasing power parity conversion factor to market exchange ratio for India compared to the US is 0.3 (http://data.worldbank.org/indicator/PA.NUS.PPPC.RF). $7.50 in the US would make $2.25 in India. Based on The Economist’s BigMac index one could argue for $2.49 in India (raw index) to $4.5 (adjusted index; http://www.economist.com/content/big-mac-index). According to (Ashenfelter 2012, http://www.nber.org/papers/w18006) wages in McDonald’s restaurants in India are 6% of the wage at a McDonald’s restaurant in the US, which could translate to paying $0.45 per hour on MTurk. Given the wide range of estimates, $2.50 might be a reasonable value.

What should be the criteria to decide and which of these two is better?

I appreciate any comments and suggestions and hope that these will be valuable to me and to other readers of Follow the Crowd.

CrisisLex: Efficiently Collecting and Filtering Tweets in Crises

Timely location of useful information during crises is critical for those forced to make life altering decisions. To stay informed, emergency responders and affected individuals increasingly rely on social media platforms, specifically on Twitter. To obtain relevant information they typically use one of the two main strategies to query Twitter:

crisislex-iswsm2014

  • Keyword-based sampling: Track the tweets that contain a set of manually identified keywords or hashtags specific to a crisis, such as #sandy for Hurricane Sandy or #bostonbombings. Yet, keywords are only as responsive as the humans curating them and, indeed, in our data, such searches returned only a fraction of the relevant tweets—only 18% to 45% of the crisis-relevant tweets were retrieved, with an average of ~33%.
  • Geo-based sampling: Track the tweets that are geo-tagged in the area of the disaster. Alas, by doing so out of the returned tweets only a small percentage are actually about the disaster—only 6% to 26% from the returned tweets are crisis-relevant, with an average of ~12.5%.

Efficiently collecting crisis-relevant information from Twitter is challenging due to the laconic language and the Twitter’s API for accessing tweets in real-time (the streaming API) limitations. Twitter can be queried by content, through the use of up to 400 keywords, or by geo-location. Specifically, if both keywords and geo-locations are given the query is interpreted as a disjunction (logical OR) of both. This is undesirable, as the public API gives access to only 1% of the data, and if the query matches on more data than that, it will return a random sample from it. Thus, as the query becomes more broad, after some point we start losing data.

To overcome these limitations, we built CrisisLex—a lexicon of terms that frequently appear in tweets posted during a variety of crises. By querying Twitter using CrisisLex, we obtain better trade-offs between how much relevant data we retrieve and how clean that data is. The lexicon contains terms such as:

  • damage
  • affected people
  • people displaced
  • donate blood
  • text redcross
  • stay safe
  • crisis deepens
  • evacuated
  • toll raises

CrisisLex has two main applications:

  • Increase the recall in the sampling of crisis-related messages (particularly at the start of the event), without incurring a significant loss in terms of precision.
  • Automatically learn the terms used to describe a new crisis and adapt the query with them.

Consequently, CrisisLex requires no manual intervention to define or adapt the query. This is particular useful, as the manual identification of keywords requires time which, in turn, may result in losing tweets due to latency. In addition, using CrisisLex does not only retrieve more comprehensive sets of crisis-relevant tweets, but it also helps to preserve the original distribution of message types and message sources.

For more detailed results on how we build and tested CrisisLex please check our paper: CrisisLex: A Lexicon for Collecting and Filtering Microblogged Communications in Crises. If you want to use CrisisLex to collect tweets, and/or want to build your own lexicon for other domains (e.g., health, politics, sports) please check our code and data (in accordance with the terms of service of Twitter’s API) at CrisisLex.org

Alexandra Olteanu, École Polytechnique Fédérale de Lausanne
Carlos Castillo, Qatar Computing Research Institute
Fernando Diaz, Microsoft Research
Sarah Vieweg, Qatar Computing Research Institute