The Human Flesh Search: Large-Scale Crowdsourcing for a Decade and Beyond

Human Flesh Search (HFS, 人肉搜索 in Chinese), a Web-enabled large-scale crowdsourcing phenomenon (mostly based on voluntary crowd power without cash rewards), originated in China a decade ago. It is a new form of search and problem solving scheme that involves the collaboration among a potentially large number of voluntary Web users. The term “human flesh,” an unfortunately bad translation from its Chinese name, refers to the human empowerment (in fact, crowd-powered search is a more appropriate English name). HFS has seen tremendous growth since its inception in 2001 (Figure 1). Figure1_updatedFigure 1. (a) Types of HFS episodes, and (b) evolution of HFS episodes based on social desirability HFS has been a unique Web phenomenon for just over 10 years. HFS presents a valuable test-bed for scientists to validate existing and new theories in social computing, sociology, behavioral sciences, and so forth. Based on a comprehensive dataset of HFS episodes collected from participants’ discussion on the Internet, we performed a series of empirical studies, focusing on the scope of HFS activities, the patterns of HFS crowd collaboration process, and the unique characteristics and dynamics of HFS participant networks. More results of the analysis of HFS participant networks can be found in two papers published in 2010 and 2012 (Additional readings 1 and 2). In this paper, a survey of HFS participants was conducted to provide an in-depth understanding of the HFS community and various factors that motivate these participants to contribute. The survey results shed light on the in-depth understanding of HFS participants and people involved in the crowdsourcing systems. Most participants voluntarily contribute to HFS, without expectation of money rewards (either real-world or virtual world money). The findings indicate great potential for researchers to explore how to design a more effective and efficient crowdsourcing system, and how to better utilize this power of the crowds for social goods, solve complex task-solving problems, and even for business purposes like marketing and management. For more, see our full paper, The Chinese “Human Flesh” Web: the first decade and beyond (free download link; preprint is also available upon request). Qingpeng Zhang, City University of Hong Kong Additoinal readings:

  1. Wang F-Y, Zeng D, Hendler J A, Zhang Q, et al (2010). A study of the human flesh search engine: Crowd-powered expansion of online knowledge. Computer, 43: 45-53. doi:10.1109/MC.2010.216
  2. Zhang Q, Wang F-Y, Zeng D, Wang T (2012). Understanding crowd-powered search groups: A social network perspective. PLoS ONE 7(6): e39749. doi:10.1371/journal.pone.0039749

Methodological Debate: How much to pay Indian and US citizens on MTurk?

This is a broadcast search request (hopefully of interest to many readers of the blog), not the presentation of research results.

When conducting research on Amazon Mechnical Turk (MTurk) you always face the question how much to pay workers. You want to be fair, to incentivize diligent work, to expedite recruiting, to sample a somehow representative cross-section of Turkers etc. For the US, I generally aim at $7.50 per hour, slightly more than the minimum wage in the US (although that is non-binding) and presumably slightly higher than the average wage on MTurk. Now I aim for a cross-cultural study comparing survey responses and experiment behavior of Turkers registered as residing in India with US workers. How much to pay in the US, how much in India? For the US it is easy: $7.50 * (expected duration of the HIT in minutes / 60). And India?

The two obvious alternatives are

  1. Pay the same for Indian workers as US workers: $7.50 per hour. MTurk is a global market place in which workers from many nations compete. It’s only fair to pay the same rate for the same work.
  2. Adjust the wage to national price level: ~ $2.50 per hour. A dollar is worth more in the US than in India. Paying the same rate leads to higher incentives for Indian workers and might bias sampling, effort, and results. According to The World Bank, the purchasing power parity conversion factor to market exchange ratio for India compared to the US is 0.3 (http://data.worldbank.org/indicator/PA.NUS.PPPC.RF). $7.50 in the US would make $2.25 in India. Based on The Economist’s BigMac index one could argue for $2.49 in India (raw index) to $4.5 (adjusted index; http://www.economist.com/content/big-mac-index). According to (Ashenfelter 2012, http://www.nber.org/papers/w18006) wages in McDonald’s restaurants in India are 6% of the wage at a McDonald’s restaurant in the US, which could translate to paying $0.45 per hour on MTurk. Given the wide range of estimates, $2.50 might be a reasonable value.

What should be the criteria to decide and which of these two is better?

I appreciate any comments and suggestions and hope that these will be valuable to me and to other readers of Follow the Crowd.

What do your food & drink habits tell about your culture?

Traditional ways to study cross-cultural differences depend on surveys, which are costly and do not scale up. We reveal another way to obtain similar data that could revolutionize the study of global culture.

We propose the use of publicly available data from location-based social networks (LBSNs) to map individual preferences. This is interesting because an LBSN check-in expresses the preference of a user for a certain type of place. LBSNs have also the characteristic to be accessible (almost) everywhere by anyone, solving the scalability problem and allowing data from the entire world to be collected, at a much lower cost (compared to traditional surveys). 

Users expressing preferences

Users expressing their preferences in LBSNs.

Our goal is to propose a new methodology for identifying cultural boundaries and similarities across populations using data collected from LBSNs. Since we know that food and drink habits are able to describe strong differences among people, we use Foursquare check-ins in such locations to represent user preferences for specific types of food and drink. We studied how these preferences change according to time of day and geographical locations. We have found that:

  • The eating and drinking choices in different countries, cities, or neighborhoods of a city reveal fascinating insights into differing habits of human beings. For instance, preferences among people in cities located in the same country tend to be very similar;
  • The time instants when check-ins are performed in food and drink places also provide valuable insights into the cultural aspects of a particular region. For example, whereas Americanss and English people tend to have their main meal at dinner time, Brazilians have it at lunch time.

Given those observations, we consider spatio-temporal dimensions of food and drink check-ins as users’ cultural preferences. We then apply a simple clustering technique to show the “cultural distance” between countries, cities or even regions within a city. We found that:

  • Our results often strongly agree with common knowledge;

  • Comparing our results with the World Values Surveys (a very large study based many years of survey data), the similarities are striking.

Clusters by cities

Clustering cities.

Yet, unlike traditional survey-based empirical studies, such as the aforementioned one, our methodology allows the identification of cultural dynamics much faster, capturing current cultural expressions at nearly real time, and at a much lower cost.

For more, see our full paper, You are What you Eat (and Drink): Identifying Cultural Boundaries by Analyzing Food & Drink Habits in Foursquare.

Thiago H Silva, Universidade Federal de Minas Gerais, Brazil
Pedro O S Vaz de Melo, Universidade Federal de Minas Gerais, Brazil
Jussara M Almeida, Universidade Federal de Minas Gerais, Brazil
Mirco Musolesi, University of Birmingham, UK
Antonio A F Loureiro, Universidade Federal de Minas Gerais, Brazil

For On-Demand Workers, It’s All About the Story

From mystery shopping to furniture assembly, apps such as TaskRabbit and Gigwalk leverage the power of distributed, mobile workers who complete physical world tasks instantly and beyond the constraints of traditional office workspaces. We refer to these workers as the “on-demand mobile workforce.” Mobile workforce services allow task requesters to “crowdsource” tasks in the physical world and aim to disrupt the very nature of employment and work (for good and bad; this may be a matter for another post).

Our paper describes an on-demand workforce service categorization based on two dimensions: (1) task location and (2) task complexity (see figure below). Based on marketplace reviews, user testimonies, and informal observations of the services, we placed four main workforce services into the quadrants to exemplify the categorization.

Categorization of on-demand workforce services.

Categorization of on-demand workforce services.

Although a long line of research on incentives and motivations for crowdsourcing exists, especially on platforms like Amazon’s Mechanical Turk, there hasn’t been much work on physical crowdsourcing, despite the recent appearance of many such platforms. We conducted interviews (see the paper here to learn more about the complete methods and findings) of mobile workforce members to learn more about the extrinsic and intrinsic factors that influence the selection and completion of physical world tasks.

To mention a couple of findings, we found certain task characteristics were highly important to workers as they select and accept tasks:

Knowing the person
Because physical world tasks introduce a different set of personal risks compared to virtual world tasks (e.g., physical harm, deception), workers creatively investigated requesters and scrutinized profile photos, email addresses, and task descriptions. Tasks with profile photos helped workers know who to expect on-site and email addresses were used to cross-reference information on social networking sites.

Knowing the “story”
Tasks that listed intended purposes or background stories of the tasks appealed to the mobile workforce. Tasks for an anniversary surprise or to verify the conditions of a grave plot through a photo affected workers’ opinions and influenced future task selections. Workers also appreciated non-financial incentives of unique experiences that occurred as byproducts of task completion (e.g., meeting new people). Tasks with questionable, unethical intentions (e.g., mailing in old phones, posting fake reviews online, writing student papers) were less likely to be fulfilled.

Generally, this study has broader implications for the design of effective, practical, novel and well-reasoned social and technical crowdsourcing applications that organize help and support in the physical world. Particularly, we hope our findings inform future development of mobile workforce services that are not strictly monetary.

Want to learn more? Check out our full paper here at CSCW 2014.

Rannie Teodoro
Pinar Ozturk
Mor Naaman
Winter Mason
Janne Lindqvist

Remote Shopping Advice: Crowdsourcing In-Store Purchase Decisions

Recent Pew reports, as well as our own survey, have found that consumers shopping in brick-and-mortar stores are increasingly using their mobile phones to contact others while they shop. The increasing capabilities of smartphones, combined with the emergence of powerful social platforms like social networking sites and crowd labor marketplaces, offer new opportunities for turning solitary in-store shopping into a rich social experience.We conducted a study to explore the potential of friendsourcing and paid crowdsourcing to enhance in-store shopping. Participants selected and tried on three outfits at a Seattle-area Eddie Bauer store; we created a single, composite image showing the three potential purchases side-by-side. Participants then posted the image to Facebook, asking their friends for feedback on which outfit to purchase; we also posted the image to Amazon’s Mechanical Turk service, and asked up to 20 U.S.-based Turkers to identify their favorite outfit, provide comments explaining their choice, and provide basic demographic information (gender, age).

Study participants posted composite photos showing their three purchase possibilities; these photos were the posted to Facebook and Mechanical Turk to crowdsource the shopping decision.

Study participants posted composite photos showing their three purchase possibilities; these photos were the posted to Facebook and Mechanical Turk to crowdsource the shopping decision.

Although none of our participants had used paid crowdsourcing before, and all were doubtful that it would be useful to them when we described what we planned to do at the start of the study session, the shopping feedback provided by paid crowd workers turned out to be surprisingly compelling to participants – more so than the friendsourced feedback from Facebook, in part because the crowd workers were more honest, explaining not only what looked good, but also what looked bad, and why! They also enjoyed the ability to see how opinions varied among different demographic groups (e.g., did male raters prefer a different outfit than female raters?).

Although Mechanical Turk had a speed advantage over Facebook, both sources generally provided multiple responses within a few minutes – fast enough that a shopper could get real-time decision-support information from the crowd while still in the store.

Our CSCW 2014 paper on “Remote Shopping Advice” describes our study in more detail, as well as how our findings can be applied toward designing next-generation social shopping experiences.

For more, see our full paper, Remote Shopping Advice: Enhancing In-Store Shopping with Social Technologies.

Meredith Ringel Morris, Microsoft Research
Kori Inkpen, Microsoft Research
Gina Venolia, Microsoft Research

Voyant: Generating Structured Feedback on Visual Designs Using a Crowd of Non-Experts

Crowdsourcing offers an emerging opportunity for users to receive rapid feedback on their designs. A critical challenge for generating feedback via crowdsourcing is to identify what type of feedback is desirable to the user, yet can be generated by non-experts. We created Voyant, a system that leverages a non-expert crowd to generate perception-oriented feedback from a selected audience as part of the design workflow.

The system generates five types of feedback: (i) Elements are the individual elements that can be seen in a design. (ii) First Notice refers to the visual order in which elements are first noticed in the design. (iii) Impressions are the perceptions formed in one’s mind upon first viewing the design. (iv) Goals refer to how well the design is perceived to meet its communicative goals. (v) Guidelines refer to how well the design is perceived to meet known guidelines in the domain.

Voyant decomposes feedback generation into a description and interpretation phase, inspired by how a critique is taught in design education. In each phase, the tasks focus a worker’s attention on specific aspects of a design rather than soliciting holistic evaluations to improve outcomes. The system submits these tasks to an online labor market (Amazon Mechanical Turk). Each type of feedback typically requires a few hours to generate and costs a few US dollars.

Our evaluation shows that users were able to leverage the feedback generated by Voyant to develop insight, and discover previously unknown problems with their designs. For example, the Impressions feedback generated by Voyant on a user’s poster (see the video above). The user intended it to be perceived as Shakespeare, but was surprised to learn of an unintended interpretation (see “dog” in word cloud).

To use Voyant, the user imports a design image and configures the crowd demographics. Once generated, the feedback can be utilized to help iterate toward an effective solution.

Try it: http://www.crowdfeedback.me

 

For more, see our full paper, Voyant: Generating Structured Feedback on Visual Designs Using a Crowd of Non-Experts.
Anbang Xu, University of Illinois at Urbana-Champaign
Shih-Wen Huang, University of Illinois at Urbana-Champaign
Brian P. Bailey, University of Illinois at Urbana-Champaign

CrowdCamp Report: HelloCrowd, The “Hello World!” of human computation

The first program a new computer programmer writes in any new programming language is the “Hello world!” program – a single line of code that prints “Hello world!” to the screen.

We ask, by analogy, what should be the first “program” a new user of crowdsourcing or human computation writes?  “HelloCrowd!” is our answer.

Hello World task

The simplest possible “human computation program”

Crowdsourcing and human computation are becoming ever more popular tools for answering questions, collecting data, and providing human judgment.  At the same time, there is a disconnect between interest and ability, where potential new users of these powerful tools don’t know how to get started.  Not everyone wants to take a graduate course in crowdsourcing just to get their first results. To fix this, we set out to build an interactive tutorial that could teach the fundamentals of crowdsourcing.

After creating an account, HelloCrowd tutorial users will get their feet wet by posting three simple tasks to the crowd platform of their choice. In addition to the “Hello, World” task above, we chose two common crowdsourcing tasks: image labeling and information retrieval from the web.  In the first task, workers provide a label for an image of a fruit, and in the second, workers must find the phone number for a restaurant. These tasks can be reused and posted to any crowd platform you like; we provide simple instructions for some common platforms.  The interactive tutorial will auto-generate the task urls for each tutorial user and for each platform.

Mmm, crowdsourcing is delicious

Mmm, crowdsourcing is delicious

More than just another tutorial on “how to post tasks to MTurk”, our goal with Hello Crowd is to teach fundamental concepts.  After posting tasks, new crowdsourcers will learn how to interpret their results (and get even better results next time).  For example: what concepts might the new crowdsourcer learn from the results for the “hello world” task or for the business phone number task?  Phone numbers are simple, right?  What about “867-5309” vs “555.867.5309” vs “+1 (555) 867 5309”?  Our goal is to get new users of these tools up to speed about  how to get good results: form validation (or not), redundancy, task instructions, etc.

In addition to teaching new crowdsourcers how to crowdsource, our tutorial system will be collecting a longitudinal, cross-platform dataset of crowd responses.  Each person who completes the tutorial will have “their” set of worker responses to the standard tasks, and these are all added together into a public dataset that will be available for future research on timing, speed, accuracy and cost.

We’re very proud of HelloCrowd, and hope you’ll consider giving our tutorial a try.

Christian M. Adriano, Donald Bren School, University of California, Irvine
Juho Kim, MIT CSAIL
Anand Kulkarni, MobileWorks
Andy Schriner, University of Cincinnati
Paul Zachary, Department of Political Science, University of California, San Diego

Can we achieve reliable inference using unreliable crowd workers?

Let us assume a set of N crowd workers are given the task of classifying a given dog image into a set of M possible breeds. Since workers may not be canine experts, they may not be able to directly classify and so we should ask simpler questions. There are two basic properties of crowd workers which cause a degraded performance of crowdsourcing systems:

  • Lack of domain expertise (which may necessitate asking binary questions rather than asking for fine classification), and
  • Unreliability (which may necessitate intelligently deployed redundancy)

The above problems can be handled by the use of error-correcting codes. Using code matrices, we can design binary questions for crowd workers that allow the task manager to reliably infer correct classification even with unreliable workers.

Untitled

The performance of a classification task is heavily dependent on the design of these simple binary questions. The question design problem is equivalent to the design of an M x N binary code matrix A={ali}. The rows correspond to the different classes while a column ai corresponds to the question to the ith worker. As an example, consider the task of classifying a dog image into one of four breeds: Pekingese, Mastiff, Maltese, or Saluki. The binary question of whether a dog has a snub nose or a long nose differentiates between {Pekingese, Mastiff} and {Maltese, Saluki}, whereas the binary question of whether the dog is small or large differentiates between {Pekingese, Maltese} and {Mastiff, Saluki}.

ex

An illustrative example is shown in the figure above for the dog breed classification task. Let the columns corresponding to the ith and jth workers be ai = [1010]‘ and aj = [1100]‘ respectively. The ith worker is asked: “Is the dog small or large?” since she is to differentiate between the first (Pekingese) or third (Maltese) breed and the others. The jth worker is asked: “Does the dog have a snub nose or a long nose?” since she is to differentiate between the first two breeds (Pekingese, Mastiff) and the others. These questions are designed from the codematrix using taxonomy of dog breeds. The task manager makes the final classification decision as the hypothesis corresponding to the code word (row) that is closest in Hamming distance to the received vector of decisions. A good codematrix can be designed using simulated annealing or cyclic column replacement based optimization.

To evaluate the performance of this scheme, the worker’s reliability can be modeled as a random variable: spammer-hammer model or the Beta model. The average probability of misclassification can be derived as a function of the mean (μ) of the workers’ reliability. The proposed scheme’s performance can be compared with the traditional voting based scheme. The summary of the results are:

  • Crowd Ordering: Better crowds yield better performance in terms of average error probability
  • Coding is better than majority vote: Good codes perform better than majority vote as they diversify the binary questions and use human cognitive energy more efficiently
  • Gap in performance generally increases for larger system size

For more, see our ICASSP 2013 paper, Reliable Classification by Unreliable Crowds
Aditya Vempaty, Syracuse University
Lav R. Varshney,  IBM Thomas J. Watson Research Center
Pramod K. Varshney, Syracuse University

Truthful Incentives in Crowdsourcing Tasks using Regret Minimization Mechanisms

Monetary incentives in Crowdsourcing platforms

Designing the right incentive structure and pricing policies for workers is the central component of online crowdsourcing platforms, e.g. Mechanical Turk.

  • The job requester’s goal is to maximize the utility derived from the task under a limited budget.
  • Workers’ goal is to maximize their individual profit by deciding which tasks to perform and at what price.

Yet, current crowdsourcing platforms only offer a limited capability to the requester in designing the pricing policies and often rules of thumb are used to price tasks. This limitation could result in an inefficient use of the requester’s budget or workers becoming disinterested in the task.

Price negotiation with workers

Previous work in this direction [Singer et al., HCOMP’11] has focused on designing online truthful mechanisms in the bidding model. This requires eliciting the true costs experienced by the worker, which can be challenging for such platforms. In this paper, we focus on the posted price model, where workers are offered a take-it-or-leave-it price, which is more easily implemented in online crowdsourcing platforms. Figure-1 shows the way price negotiation happens in our posted price model.

PostedPriceModel

Figure 1: Negotiation between requester and worker in the posted-price model compared to that in the bidding model. b_i denotes the bid shared by i_th worker, c_i being the true cost experienced by him. p_i denotes the payment offered to the worker.

Our approach

The main challenge in determining the payments is the unknown distribution of the workers’ cost (“cost curve”) F(p), illustrated in Figure 2.  This leads to the challenge of trading exploration and exploitation – the mechanism needs to “explore” by experimenting with potentially suboptimal prices and has to “exploit” its learning by offering the price that appears best so far. We cast this problem as a multi-armed bandit (MAB) problem, however under a strict budget constraint B and use the approach of regret minimization in online learning to design our mechanism.

Figure 2. costcurve-FB

Figure 2: Upper bound on utility achievable through budget constraint (in red), unknown price curve of workers (in blue) and optimal price (in green). B is the fixed budget and N is the fixed number of workers.

Our mechanism BP-UCB

  • We present a novel posted-price mechanism, BP-UCB, for online budgeted procurement, which is guaranteed to be budget feasible, achieves near-optimal utility for the requester and be incentive compatible (truthful) for workers.
  • We prove no-regret bounds for it and our analysis yields insights into an explicit separation of the regret in terms of wasted budget through overpayment and rejected offers through underpayment.

Experimental Results

We carry out extensive experiments to understand the practical performance of our mechanisms on simulated cost distributions (Figure 3 below). Additionally, to demonstrate the effectiveness of our approach on real world inputs, we carried out a Mechanical Turk study to collect real cost distributions from the workers (Figure 4 below).

Figure 3. results-synthetic-uniform

Figure 3: Simulated cost distributions. Left: Compared to state of the art posted-price mechanism (pp’12) [Badanidiyuru et al., EC’12], our mechanism (bp-ucb) shows up to a 150% increase in utility for given budet. Right: shows that average regret of our mechanism bp-ucb diminishes with increasing budget.

Figure 4: results-mturk

Figure 4: Cost distribution from Mechanical Turk. Left: shows utility achieved for random arrival of workers and shows a 180% increase in utility compared to state of art posted-price mechanism (bp-ucb vs pp’12). Right: demonstrates the robustness of the mechanism against extreme adversarial inputs, simulated by arrival of workers in ascending order of their costs.

For more, see our full paper, Truthful Incentives in Crowdsourcing Tasks using Regret Minimization Mechanisms.
Adish Singla, ETH Zurich
Andreas Krause, ETH Zurich

Let the crowd wrap the web

The web is a valuable source of information, but most of the data can not be automatically processed since are intended for human-consumption.
Wrappers are specialized programs that extract the data from the source code of HTML pages and organize them in a more structured way, making them machine processable.

For example, suppose we want to collect data about movies (e.g. titles, directors, actors, etc.) by means of a set of wrapper extracting the data from the sites available on the Web. Other than the most famous (e.g., IMDB)  web sites, many others can be considered. [Dalvi et al., VLDB 2012] have shown that in many domains, for covering 90% of the entities present in the Web, more than 10000 sites have to be considered.

Fully automated approaches for learning wrappers have  been already proposed (e.g. RoadRunner [Crescenzi and Merialdo, AAI 2008]), but they exhibit limited accuracy. On the other side, supervised wrapper generator have limited applicability at the web scale. The crowd could be the trigger for addressing the problem of wrapping very large numbers of data intensive Web sites with high accuracy.

Figure 1

The web application interface of ALFRED, the query and the page are visualized, and the worker can answer with a binary value (Yes/No). To help the worker the queried value is highlighted.

We propose ALFRED [Crescenzi et al. WWW2013, DBCrowd2013], a wrapper inference system supervised by the crowd. To generate wrappers, the system poses sequences of simple questions that require a boolean answer (e.g. “Is ‘City of God’ the title of the movie in the page?” Y/N). The answers provided by the workers recruited on a crowdsourcing platform are exploited to generate the correct wrapper.

Preliminary results are promising:

  • To generate accurate wrappers, just a few queries are needed. Even in presence of inaccurate workers, ALFRED can generate a correct wrapper with less than 15 queries.

  • The accuracy of the output wrapper is highly predictable, with an average F-measure close to 100% and its standard deviation less than 1%, i.e., almost perfect wrapper with a small variability.

  • Workers’ error rates estimation is accurate, and spammers and unreliable workers are early detected.

  • Costs are contained and highly predictable thanks to a technique to dynamically engage, at runtime, a minimal number of workers, with 92% of the cases covered by just two workers.

Many challenges are still open:

  • to further reduce the costs we aim at adopting a hybrid approach that partially relies on automatic wrapper generation techniques, with a light supervision by the crowd

  • gamification is a promising direction to engage workers and scale out the wrappers  generation. People can play games while teaching ALFRED how to wrap the web.

For more, see our full paper project website, ALFRED, and the full paper, A framework for learning web wrappers from the crowd.

Valter Crescenzi, Università Roma Tre
Paolo Merialdo, Università Roma Tre
Disheng Qiu, Università Roma Tre