CrowdCamp Report: Benchmarking the Crowd

As crowdsourcing evolves, the crowds are evolving too.  Mechanical Turk is a different population than it was a few years ago.  There are different crowds at different times of day.  Different crowds may be better or worse for one application or another — CrowdFlower, MobileWorks, or even a crowd of employees within a company or students within a school.

In particular, how can researchers and developers cooperate to collect aggregate data about system properties (e.g. latency, throughput, noise), demographics (gender, age, socioeconomic level), and human performance (motor, perceptual, attention) for the various crowds that they use?

census_cropped

We started exploring this question in a weekend CrowdCamp hackathon at CSCW 2013.  Some concrete steps and discoveries included:

  • We gathered 25 datasets from a wide variety of experiments on Mechanical Turk by many different researchers, ranging from 2008 to 2013.  We found 30,000 unique workers in our sample, and in the most recent datasets, between 20% and 40% workers who had also contributed to previous datasets.  So at least on MTurk, the crowd is stable enough for benchmarking between researchers to be a viable idea.
  • We prototyped a deployment platform, Census, that injects a small benchmarking task into any researcher’s existing HIT, using only one line of Javascript.  The image above shows an example Census task in action.
  • We trawled the recent research literature for possible benchmarking tasks, including affect detection, image tagging, and word sense disambiguation.

We also discovered that Mechanical Turk worker IDs are not as anonymous as researchers generally assume.  For benchmarking that shares information among researchers, it will be necessary to take additional steps to protect worker privacy while preserving the ability to connect the same workers across studies.

Saeideh Bakshi, Georgia Tech
Michael Bernstein, Stanford University
Jeff Bigham, University of Rochester
Jessica Hullman, University of Michigan
Juho Kim, MIT CSAIL
Walter Lasecki, University of Rochester
Matt Lease, University of Texas Austin
Rob Miller, MIT CSAIL
Tanushree Mitra, Georgia Tech

 

Mechanical Turk Workers Are Not Anonymous

Users of Amazon Mechanical Turk generally believe that the workers are anonymous, identified only by a long obscure identifier like A3IZSXSSGW80FN. (That’s mine.) A worker’s name or contact information can’t be discovered unless the worker chooses to provide it.

But it isn’t true. Many MTurk workers are rather easy to identify.

Take a typical worker ID. If you’ve ever used MTurk yourself, you can find your own worker ID on your Dashboard, on the far right:

Screen Shot 2013-02-28 at 10.06.03 PM

Just search the web for your worker ID, and you may find a surprising number of results:

  • wish lists
  • book reviews
  • tagged Amazon products

In fact, many workers even have a public Amazon profile page containing their real name and sometimes even a photo, at http://www.amazon.com/gp/pdp/profile/workerID. Here’s my profile:

Amazon public profile showing my real name

In preliminary testing with published datasets containing turker IDs, about 50% of worker IDs we tried had a public profile page, and about 30% of IDs had a discoverable real name.  A smaller percentage had a photo as well.

The fundamental problem here is that all of Amazon uses the same identifier for the worker’s account. Every interaction with Amazon is tagged by that identifier, and many of those interactions produce public pages containing the identifier, which are indexed by search engines.

We discovered this fact at a CrowdCamp workshop last weekend at the CSCW 2013 conference, and it came as a stunning surprise to a room full of researchers with years of experience using Mechanical Turk.

The implications are sweeping:

* For academic researchers: worker IDs may have to be treated as personally identifiable information. For example, publishing worker IDs online in public data sets may be a violation of worker privacy, and counter to the requirements of the researcher’s institutional review board.

* For workers: if you want to protect your online identity and retain anonymity on MTurk, you should register a different Amazon account expressly for MTurk, and not use the same account that you use for Amazon purchasing. But note that if you already have an MTurk account, creating a new one would lose any reputation you’ve built up.

* For Amazon itself: this is a privacy hole that needs addressing. The best solution would be to use distinct identifiers for MTurk and other Amazon properties (even if the same login account). At a minimum, however, workers and requesters should be made aware of this privacy risk, and workers whose accounts are publicly-identifiable should be permitted to create new ones without loss of reputation.

For more detail, see our working paper, Mechanical Turk Is Not Anonymous, which will be posted by March 7.

Saeideh Bakshi, Georgia Tech
Michael Bernstein, Stanford University
Jeff Bigham, University of Rochester
Jessica Hullman, University of Michigan
Juho Kim, MIT CSAIL
Walter Lasecki, University of Rochester
Matt Lease, University of Texas Austin
Rob Miller, MIT CSAIL
Tanushree Mitra, Georgia Tech

 

Social Proof, Graph Perception, and the Wisdom of the Crowd

People love looking at visualizations of data, prompting the creation of online systems like IBM’s Many Eyes and Data360 where groups of web users can gather to create, analyze, and discuss graphs. Organizations and collective knowledge at large benefit from the insights generated by groups as they collaboratively analyze socially-relevant data. Yet what seems a win-win situation falls short of effective when negative effects of group-think, like social influence, diminish the quality of the collective signal. Asch [1] first showed this in a well known experiment where subjects were asked to match the length of a given line to that of one of three lines lines of different length that were also shown. When confederates answered before the subject, and each picked the wrong line, the subject more often than not chose the wrong line as well.

In Asch’s work, the subject’s reliance on the erroneous group response is thought to be based on his desire to fit in with the group. It is also possible, though, that a person will rely on social information in even a simple task out of a desire to be accurate him or herself. We applied the concept of social proof, when a person looks to others’ behavior in deciding whether to engage in the activity him or herself, to a graph perception task in which subjects were motivated to be correct. We set up a series of proportion judgment and linear association estimation tasks on Amazon’s Mechanical Turk, and asked subjects to supply their best judgments with and without access to social information on other workers’ responses. We were interested in knowing how the final group response would be affected by the presence of a social signal. Prior responses for a given graph task were shown in a histogram representation with a peak value set to the actual group response when no social information was shown. Yet in order to see how the quality of the social information signal affected the new group response in our social conditions, we also created a set of histograms centered on a peak value one standard deviation from the actual non-social group response. Subjects in our social condition saw a mix of both types of social histograms: some with a peak at the “faithful” or near accurate response, versus others with a peak centered on the “offset”, biased response.

Our results showed that the lowest mean errors for a given graph task resulted when the Turkers saw the faithful social histogram that closely approximated the true value for the task. The mean errors for the group who saw no social information were only slightly higher. Yet those who saw the more biased offset social histogram with a peak value one standard deviation from the non-social group response made significantly more errors over both other conditions. In other words, the quality of the crowd’s response for the task depends on how accurately the social signal they are shown captures the truth.

This prompted a second question – how much does the amount of prior social information shown affect whether a new user relies the signal? We re-ran both social conditions for the same graphs, but this time, we systematically varied the number of responses shown in the graph. One might assume that less prior responses makes for a less “trustworthy” social signal, hence new users will rely less on the social information for their own judgments. Yet analysis showed that it didn’t matter how many responses are shown. In other words, a social signal based on as few as 1 or 5 prior responses was as good in the eyes of our subjects as one based on 50 scores. The implication is that a dynamic like an information cascade take hold when social signals are in place, with initial responses propagating across a community as new users weight the social answer over and above their private judgment.

In combination, our findings raise some challenging questions for the design of crowdsourcing systems for visual analytics. If social information can lead to less accurate group decisions, should the info be shown at all? Given our observation that the “faithful” social histogram centered on the non-social group response led to slightly lower errors than seeing no social information at all, it may be possible for social information to improve the group response under certain conditions. This possibility may well extend to  other online systems where social information is displayed. Should social information be withheld by such systems instead, until a large enough number of responses have been gathered? What happens when systematic biases, or shared human tendencies to be biased in the same direction (such as to over or underestimate visualized quantities in a visual analytics context) , cause the group response to be inaccurate regardless? Are there ways that system designers can intelligently combine responses to get accurate collective signals, combining what we know about the social dynamics that can occur with knowledge of systematic biases affecting how people interpret visual and other information?  These are just a few of the questions we are now considering.

You can read the full details of this research in our CHI 2011 publication:
Hullman, J., Adar, E., and Shah, P. 2011. The impact of social information on visual judgments. In Proc of CHI ’11. ACM, New York, NY, USA, 1461-1470.

You can download the paper here.

References:

[1] Asch, S.E. Effects of group pressure upon the modification and distortion of judgment. In Groups, leadership and men. Edited by H. Guetzkow. Pittsburgh, PA., 1951

About the author: Jessica Hullman (jessica.hullman@gmail.com, http://jhullman.people.si.umich.edu) is a PhD student at the University of Michigan. Her research looks at how the challenges presented to the quality of collective insight when non-expert users gather to analyze visualizations, and how those challenges might be overcome by design interventions at the graph or system level.