Human vs machine: establishing a human baseline for multimodal location estimation

Authors:
Jaeyoung Choi;Howard Lei;Venkatesan Ekambaram;Pascal Kelm;Luke Gottlieb;Thomas Sikora;Kannan Ramchandran;Gerald Friedland
Affiliations:
International Computer Science Institute, Berkeley, CA, USA;International Computer Science Institute, Berkeley, CA, USA;University of California at Berkeley, Berkeley, CA, USA;Technische Universitat Berlin, Berlin, Germany;International Computer Science Institute, Berkeley, CA, USA;Technische Universitat Berlin, Berlin, Germany;University of California at Berkeley, Berkeley, CA, USA;International Computer Science Institute, Berkeley, CA, USA
Venue:
Proceedings of the 21st ACM international conference on Multimedia
Year:
2013

Citing 20
Cited 0

Image Based Localization in Urban Environments

3DPVT '06 Proceedings of the Third International Symposium on 3D Data Processing, Visualization, and Transmission (3DPVT'06)
LabelMe: A Database and Web-Based Tool for Image Annotation

International Journal of Computer Vision
Crowdsourcing user studies with Mechanical Turk

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
FCTH: Fuzzy Color and Texture Histogram - A Low Level Feature for Accurate Image Retrieval

WIAMIS '08 Proceedings of the 2008 Ninth International Workshop on Image Analysis for Multimedia Interactive Services
Methods for extracting place semantics from Flickr tags

ACM Transactions on the Web (TWEB)
Graphical Models, Exponential Families, and Variational Inference

Foundations and Trends® in Machine Learning
Mapping the world's photos

Proceedings of the 18th international conference on World wide web
Placing flickr photos on a map

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Enhancing semantic and geographic annotation of web images via logistic canonical correlation regression

MM '09 Proceedings of the 17th ACM international conference on Multimedia
CEDD: color and edge directivity descriptor: a compact descriptor for image indexing and retrieval

ICVS'08 Proceedings of the 6th international conference on Computer vision systems
Analyzing the Amazon Mechanical Turk marketplace

XRDS: Crossroads, The ACM Magazine for Students - Comp-YOU-Ter
Multimodal location estimation

Proceedings of the international conference on Multimedia
Geotagging in multimedia and computer vision--a survey

Multimedia Tools and Applications
Automatic tagging and geotagging in video collections and communities

Proceedings of the 1st ACM International Conference on Multimedia Retrieval
Multimodal location estimation on Flickr videos

WSM '11 Proceedings of the 3rd ACM SIGMM international workshop on Social media
A hierarchical, multi-modal approach for placing videos on the map using millions of Flickr photographs

SBNMA '11 Proceedings of the 2011 ACM workshop on Social and behavioural networked media access
Probabilistic linear discriminant analysis

ECCV'06 Proceedings of the 9th European conference on Computer Vision - Volume Part IV
Multimodal Location Estimation of Consumer Media: Dealing with Sparse Training Data

ICME '12 Proceedings of the 2012 IEEE International Conference on Multimedia and Expo
Pushing the limits of mechanical turk: qualifying the crowd for video geo-location

Proceedings of the ACM multimedia 2012 workshop on Crowdsourcing for multimedia
Methods for extracting place semantics from Flickr tags

ACM Transactions on the Web (TWEB)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Over the recent years, the problem of video location estimation (i.e., estimating the longitude/latitude coordinates of a video without GPS information) has been approached with diverse methods and ideas in the research community and significant improvements have been made. So far, however, systems have only been compared against each other and no systematic study on human performance has been conducted. Based on a human-subject study with 11,900 experiments, this article presents a human baseline for location estimation for different combinations of modalities (audio, audio/video, audio/video/text). Furthermore, this article compares state-of-the-art location estimation systems with the human baseline. Although the overall performance of humans' multimodal video location estimation is better than current machine learning approaches, the difference is quite small: For 41% of the test set, the machine's accuracy was superior to the humans. We present case studies and discuss why machines did better for some videos and not for others. Our analysis suggests new directions and priorities for future work on the improvement of location inference algorithms.