Multimodal location estimation

Authors:
Gerald Friedland;Oriol Vinyals;Trevor Darrell
Affiliations:
International Computer Science Insitute, Berkeley, USA;International Computer Science Insitute, Berkeley, USA;International Computer Science Insitute, Berkeley, CA, USA
Venue:
Proceedings of the international conference on Multimedia
Year:
2010

Citing 8
Cited 8

Object Recognition from Local Scale-Invariant Features

ICCV '99 Proceedings of the International Conference on Computer Vision-Volume 2 - Volume 2
Latent dirichlet allocation

The Journal of Machine Learning Research
Early versus late fusion in semantic video analysis

Proceedings of the 13th annual ACM international conference on Multimedia
Early versus late fusion in semantic video analysis

Proceedings of the 13th annual ACM international conference on Multimedia
Image Based Localization in Urban Environments

3DPVT '06 Proceedings of the Third International Symposium on 3D Data Processing, Visualization, and Transmission (3DPVT'06)
What Does the Sky Tell Us about the Camera?

ECCV '08 Proceedings of the 10th European Conference on Computer Vision: Part IV
Estimating Geo-temporal Location of Stationary Cameras Using Shadow Trajectories

ECCV '08 Proceedings of the 10th European Conference on Computer Vision: Part I
Prosodic and other Long-Term Features for Speaker Diarization

IEEE Transactions on Audio, Speech, and Language Processing

Video2GPS: a demo of multimodal location estimation on flickr videos

MM '11 Proceedings of the 19th ACM international conference on Multimedia
Multimodal location estimation on Flickr videos

WSM '11 Proceedings of the 3rd ACM SIGMM international workshop on Social media
Sherlock holmes' evil twin: on the impact of global inference for online privacy

Proceedings of the 2011 workshop on New security paradigms workshop
A universal approach that makes legacy online content location-based

Proceedings of the 10th International Conference on Mobile and Ubiquitous Multimedia
GIANT: geo-informative attributes for location recognition and exploration

Proceedings of the 21st ACM international conference on Multimedia
Human vs machine: establishing a human baseline for multimodal location estimation

Proceedings of the 21st ACM international conference on Multimedia
Latent feature learning in social media network

Proceedings of the 21st ACM international conference on Multimedia
A novel fusion method for integrating multiple modalities and knowledge for multimodal location estimation

Proceedings of the 2nd ACM international workshop on Geotagging and its applications in multimedia

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this article we define a multimedia content analysis problem, which we call multimodal location estimation: Given a video/image/audio file, the task is to determine where it was recorded. A single indication, such as a unique landmark, might already pinpoint a location precisely. In most cases, however, a combination of evidence from the visual and the acoustic domain will only narrow down the set of possible answers. Therefore, approaches to tackle this task should be inherently multimedia. While the task is hard, in fact sometimes unsolvable, training data can be leveraged from the Internet in large amounts. Moreover, even partially successful automatic estimation of location opens up new possibilities in video content matching, archiving, and organization. It could revolutionize law enforcement and computer-aided intelligence agency work, especially since both semi-automatic and fully automatic approaches would be possible. In this article, we describe our idea of growing multimodal location estimation as a research field in the multimedia community. Based on examples and scenarios, we propose a multimedia approach to leverage cues from the visual and the acoustic portions of a video as well as from given metadata. We also describe experiments to estimate the amount of available training data that could potentially be used as publicly available infrastructure for research in this field. Finally, we present an initial set of results based on acoustic and visual cues and discuss the massive challenges involved and some possible paths to solutions.