Efficiently Scaling up Crowdsourced Video Annotation

Authors:
Carl Vondrick;Donald Patterson;Deva Ramanan
Affiliations:
Department of Computer Science, UC Irvine, Irvine, USA and Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, USA;Department of Informatics, UC Irvine, Irvine, USA;Department of Computer Science, UC Irvine, Irvine, USA
Venue:
International Journal of Computer Vision
Year:
2013

Citing 18
Cited 4

Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope

International Journal of Computer Vision
Labeling images with a computer game

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Keyframe-based tracking for rotoscoping and animation

ACM SIGGRAPH 2004 Papers
No task left behind?: examining the nature of fragmented work

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Histograms of Oriented Gradients for Human Detection

CVPR '05 Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 1 - Volume 01
Peekaboom: a game for locating objects in images

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Interactive Feature Tracking using K-D Trees and Dynamic Programming

CVPR '06 Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 1
Object tracking: A survey

ACM Computing Surveys (CSUR)
Evaluation campaigns and TRECVid

MIR '06 Proceedings of the 8th ACM international workshop on Multimedia information retrieval
LabelMe: A Database and Web-Based Tool for Image Annotation

International Journal of Computer Vision
LIBLINEAR: A Library for Large Linear Classification

The Journal of Machine Learning Research
80 Million Tiny Images: A Large Data Set for Nonparametric Object and Scene Recognition

IEEE Transactions on Pattern Analysis and Machine Intelligence
The Pascal Visual Object Classes (VOC) Challenge

International Journal of Computer Vision
Who are the crowdworkers?: shifting demographics in mechanical turk

CHI '10 Extended Abstracts on Human Factors in Computing Systems
Efficiently scaling up video annotation with crowdsourced marketplaces

ECCV'10 Proceedings of the 11th European conference on Computer vision: Part IV
AVSS 2011 demo session: A large-scale benchmark dataset for event recognition in surveillance video

AVSS '11 Proceedings of the 2011 8th IEEE International Conference on Advanced Video and Signal Based Surveillance
FlowBoost -- Appearance learning from sparsely annotated video

CVPR '11 Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition
Detecting activities of daily living in first-person camera views

CVPR '12 Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Tagging human activities in video by crowdsourcing

Proceedings of the 3rd ACM conference on International conference on multimedia retrieval
Human nonverbal behavior multi-sourced ontological annotation

Proceedings of the International Workshop on Video and Image Ground Truth in Computer Vision Applications
A crowdsourcing approach to support video annotation

Proceedings of the International Workshop on Video and Image Ground Truth in Computer Vision Applications
Car detection in sequences of images of urban environments using mixture of deformable part models

Pattern Recognition Letters

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present an extensive three year study on economically annotating video with crowdsourced marketplaces. Our public framework has annotated thousands of real world videos, including massive data sets unprecedented for their size, complexity, and cost. To accomplish this, we designed a state-of-the-art video annotation user interface and demonstrate that, despite common intuition, many contemporary interfaces are sub-optimal. We present several user studies that evaluate different aspects of our system and demonstrate that minimizing the cognitive load of the user is crucial when designing an annotation platform. We then deploy this interface on Amazon Mechanical Turk and discover expert and talented workers who are capable of annotating difficult videos with dense and closely cropped labels. We argue that video annotation requires specialized skill; most workers are poor annotators, mandating robust quality control protocols. We show that traditional crowdsourced micro-tasks are not suitable for video annotation and instead demonstrate that deploying time-consuming macro-tasks on MTurk is effective. Finally, we show that by extracting pixel-based features from manually labeled key frames, we are able to leverage more sophisticated interpolation strategies to maximize performance given a fixed budget. We validate the power of our framework on difficult, real-world data sets and we demonstrate an inherent trade-off between the mix of human and cloud computing used vs. the accuracy and cost of the labeling. We further introduce a novel, cost-based evaluation criteria that compares vision algorithms by the budget required to achieve an acceptable performance. We hope our findings will spur innovation in the creation of massive labeled video data sets and enable novel data-driven computer vision applications.