Object Recognition from Local Scale-Invariant Features
ICCV '99 Proceedings of the International Conference on Computer Vision-Volume 2 - Volume 2
The Journal of Machine Learning Research
Early versus late fusion in semantic video analysis
Proceedings of the 13th annual ACM international conference on Multimedia
Early versus late fusion in semantic video analysis
Proceedings of the 13th annual ACM international conference on Multimedia
Image Based Localization in Urban Environments
3DPVT '06 Proceedings of the Third International Symposium on 3D Data Processing, Visualization, and Transmission (3DPVT'06)
What Does the Sky Tell Us about the Camera?
ECCV '08 Proceedings of the 10th European Conference on Computer Vision: Part IV
Estimating Geo-temporal Location of Stationary Cameras Using Shadow Trajectories
ECCV '08 Proceedings of the 10th European Conference on Computer Vision: Part I
Prosodic and other Long-Term Features for Speaker Diarization
IEEE Transactions on Audio, Speech, and Language Processing
Video2GPS: a demo of multimodal location estimation on flickr videos
MM '11 Proceedings of the 19th ACM international conference on Multimedia
Multimodal location estimation on Flickr videos
WSM '11 Proceedings of the 3rd ACM SIGMM international workshop on Social media
Sherlock holmes' evil twin: on the impact of global inference for online privacy
Proceedings of the 2011 workshop on New security paradigms workshop
A universal approach that makes legacy online content location-based
Proceedings of the 10th International Conference on Mobile and Ubiquitous Multimedia
GIANT: geo-informative attributes for location recognition and exploration
Proceedings of the 21st ACM international conference on Multimedia
Human vs machine: establishing a human baseline for multimodal location estimation
Proceedings of the 21st ACM international conference on Multimedia
Latent feature learning in social media network
Proceedings of the 21st ACM international conference on Multimedia
Proceedings of the 2nd ACM international workshop on Geotagging and its applications in multimedia
Hi-index | 0.00 |
In this article we define a multimedia content analysis problem, which we call multimodal location estimation: Given a video/image/audio file, the task is to determine where it was recorded. A single indication, such as a unique landmark, might already pinpoint a location precisely. In most cases, however, a combination of evidence from the visual and the acoustic domain will only narrow down the set of possible answers. Therefore, approaches to tackle this task should be inherently multimedia. While the task is hard, in fact sometimes unsolvable, training data can be leveraged from the Internet in large amounts. Moreover, even partially successful automatic estimation of location opens up new possibilities in video content matching, archiving, and organization. It could revolutionize law enforcement and computer-aided intelligence agency work, especially since both semi-automatic and fully automatic approaches would be possible. In this article, we describe our idea of growing multimodal location estimation as a research field in the multimedia community. Based on examples and scenarios, we propose a multimedia approach to leverage cues from the visual and the acoustic portions of a video as well as from given metadata. We also describe experiments to estimate the amount of available training data that could potentially be used as publicly available infrastructure for research in this field. Finally, we present an initial set of results based on acoustic and visual cues and discuss the massive challenges involved and some possible paths to solutions.