Wednesday, January 12, 2011

Panos Ipeirotis: Get Another Label? Improving Data Quality and Machine Learning using Multiple, Noisy Labelers

Title: Get Another Label? Improving Data Quality and Machine Learning
using Multiple, Noisy Labelers

Time: Thursday, January 13th, from 12:00pm to 1:00pm.
Room: GHC6115
Speaker: Panos Ipeirotis (

Abstract: I will discuss the repeated acquisition of "labels" for data
items when the labeling is imperfect. Labels are values provided by humans
for specified variables on data items, such as "PG-13" for "Adult Content
Rating on this Web Page." With the increasing popularity of
micro-outsourcing systems, such as Amazon's Mechanical Turk, it often is
possible to obtain less-than-expert labeling at low cost. We examine the
improvement (or lack thereof) in data quality via repeated labeling, and
focus especially on the improvement of training labels for supervised
induction. We present repeated-labeling strategies of increasing
complexity, and show several main results: (i) Repeated-labeling can
improve label quality and model quality (per unit data-acquisition cost),
but not always. (ii) Simple strategies can give considerable advantage,
and carefully selecting a chosen set of points for labeling does even
better (we present and evaluate several techniques). (iii) Labeler
(worker) quality can be estimated on the fly (e.g., to determine
compensation, control quality or eliminate Mechanical Turk spammers) and
systematic biases can be corrected. I illustrate the results with a
real-life application from on-line advertising: using Mechanical Turk to
help classify web pages as being objectionable to advertisers. Time
permitting, I will also discuss our latest results showing that mice and
Mechanical Turk workers are not that different after all.
This is joint work with Foster Provost, Victor S. Sheng, and Jing Wang. An
earlier version of the work received the Best Paper Award Runner-up at the
ACM SIGKDD Conference.