Let the crowd wrap the web

The web is a valuable source of information, but most of the data can not be automatically processed since are intended for human-consumption.
Wrappers are specialized programs that extract the data from the source code of HTML pages and organize them in a more structured way, making them machine processable.

For example, suppose we want to collect data about movies (e.g. titles, directors, actors, etc.) by means of a set of wrapper extracting the data from the sites available on the Web. Other than the most famous (e.g., IMDB)  web sites, many others can be considered. [Dalvi et al., VLDB 2012] have shown that in many domains, for covering 90% of the entities present in the Web, more than 10000 sites have to be considered.

Fully automated approaches for learning wrappers have  been already proposed (e.g. RoadRunner [Crescenzi and Merialdo, AAI 2008]), but they exhibit limited accuracy. On the other side, supervised wrapper generator have limited applicability at the web scale. The crowd could be the trigger for addressing the problem of wrapping very large numbers of data intensive Web sites with high accuracy.

Figure 1

The web application interface of ALFRED, the query and the page are visualized, and the worker can answer with a binary value (Yes/No). To help the worker the queried value is highlighted.

We propose ALFRED [Crescenzi et al. WWW2013, DBCrowd2013], a wrapper inference system supervised by the crowd. To generate wrappers, the system poses sequences of simple questions that require a boolean answer (e.g. “Is ‘City of God’ the title of the movie in the page?” Y/N). The answers provided by the workers recruited on a crowdsourcing platform are exploited to generate the correct wrapper.

Preliminary results are promising:

  • To generate accurate wrappers, just a few queries are needed. Even in presence of inaccurate workers, ALFRED can generate a correct wrapper with less than 15 queries.

  • The accuracy of the output wrapper is highly predictable, with an average F-measure close to 100% and its standard deviation less than 1%, i.e., almost perfect wrapper with a small variability.

  • Workers’ error rates estimation is accurate, and spammers and unreliable workers are early detected.

  • Costs are contained and highly predictable thanks to a technique to dynamically engage, at runtime, a minimal number of workers, with 92% of the cases covered by just two workers.

Many challenges are still open:

  • to further reduce the costs we aim at adopting a hybrid approach that partially relies on automatic wrapper generation techniques, with a light supervision by the crowd

  • gamification is a promising direction to engage workers and scale out the wrappers  generation. People can play games while teaching ALFRED how to wrap the web.

For more, see our full paper project website, ALFRED, and the full paper, A framework for learning web wrappers from the crowd.

Valter Crescenzi, Università Roma Tre
Paolo Merialdo, Università Roma Tre
Disheng Qiu, Università Roma Tre 

2 thoughts on “Let the crowd wrap the web

  1. Very interesting approach! Excited to read the paper. I would like to ask: what factors affect the average learning and training times? and how do you deal with decreasing the average learning and training times. Because, I could imagine that requesters of the wrapper expect very short response time, therefore I think the average times for learning and training are important factors for the fast runtime response.

    • Hi Emomeni,
      The main factor that affects the learning time is the money that we want to spend to generate the wrappers. In our actual optimization we dynamically engage workers from the crowdsourcing platform, starting with a minimal among of redundancy (say 2). If answers are not “good enough” the system engages a new worker, increasing the latency, but saving money. A way to reduce the latency is to engage at the beginning a bigger number of workers, but this costs more.
      In our experiments the working time to generate a wrapper is just few minutes.
      Dealing with a crowdsourcing platform, we also have to consider the time for the worker to accept the task. It can range from few minutes to hours (it depends on many factors!).
      If you want to read the DBCrowd paper just contact me

Comments are closed.