Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope
International Journal of Computer Vision
Labeling images with a computer game
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Keyframe-based tracking for rotoscoping and animation
ACM SIGGRAPH 2004 Papers
No task left behind?: examining the nature of fragmented work
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Histograms of Oriented Gradients for Human Detection
CVPR '05 Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 1 - Volume 01
Peekaboom: a game for locating objects in images
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Interactive Feature Tracking using K-D Trees and Dynamic Programming
CVPR '06 Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 1
ACM Computing Surveys (CSUR)
Evaluation campaigns and TRECVid
MIR '06 Proceedings of the 8th ACM international workshop on Multimedia information retrieval
LabelMe: A Database and Web-Based Tool for Image Annotation
International Journal of Computer Vision
LIBLINEAR: A Library for Large Linear Classification
The Journal of Machine Learning Research
80 Million Tiny Images: A Large Data Set for Nonparametric Object and Scene Recognition
IEEE Transactions on Pattern Analysis and Machine Intelligence
The Pascal Visual Object Classes (VOC) Challenge
International Journal of Computer Vision
Who are the crowdworkers?: shifting demographics in mechanical turk
CHI '10 Extended Abstracts on Human Factors in Computing Systems
Efficiently scaling up video annotation with crowdsourced marketplaces
ECCV'10 Proceedings of the 11th European conference on Computer vision: Part IV
AVSS 2011 demo session: A large-scale benchmark dataset for event recognition in surveillance video
AVSS '11 Proceedings of the 2011 8th IEEE International Conference on Advanced Video and Signal Based Surveillance
FlowBoost -- Appearance learning from sparsely annotated video
CVPR '11 Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition
Detecting activities of daily living in first-person camera views
CVPR '12 Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Tagging human activities in video by crowdsourcing
Proceedings of the 3rd ACM conference on International conference on multimedia retrieval
Human nonverbal behavior multi-sourced ontological annotation
Proceedings of the International Workshop on Video and Image Ground Truth in Computer Vision Applications
A crowdsourcing approach to support video annotation
Proceedings of the International Workshop on Video and Image Ground Truth in Computer Vision Applications
Car detection in sequences of images of urban environments using mixture of deformable part models
Pattern Recognition Letters
Hi-index | 0.00 |
We present an extensive three year study on economically annotating video with crowdsourced marketplaces. Our public framework has annotated thousands of real world videos, including massive data sets unprecedented for their size, complexity, and cost. To accomplish this, we designed a state-of-the-art video annotation user interface and demonstrate that, despite common intuition, many contemporary interfaces are sub-optimal. We present several user studies that evaluate different aspects of our system and demonstrate that minimizing the cognitive load of the user is crucial when designing an annotation platform. We then deploy this interface on Amazon Mechanical Turk and discover expert and talented workers who are capable of annotating difficult videos with dense and closely cropped labels. We argue that video annotation requires specialized skill; most workers are poor annotators, mandating robust quality control protocols. We show that traditional crowdsourced micro-tasks are not suitable for video annotation and instead demonstrate that deploying time-consuming macro-tasks on MTurk is effective. Finally, we show that by extracting pixel-based features from manually labeled key frames, we are able to leverage more sophisticated interpolation strategies to maximize performance given a fixed budget. We validate the power of our framework on difficult, real-world data sets and we demonstrate an inherent trade-off between the mix of human and cloud computing used vs. the accuracy and cost of the labeling. We further introduce a novel, cost-based evaluation criteria that compares vision algorithms by the budget required to achieve an acceptable performance. We hope our findings will spur innovation in the creation of massive labeled video data sets and enable novel data-driven computer vision applications.