Semantic structure from motion

Authors:
S. Y. Bao;S. Savarese
Affiliations:
Dept. of Electr. & Comput. Eng., Univ. of Michigan at Ann Arbor, Ann Arbor, MI, USA;Dept. of Electr. & Comput. Eng., Univ. of Michigan at Ann Arbor, Ann Arbor, MI, USA
Venue:
CVPR '11 Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition
Year:
2011

Citing 0
Cited 5

Object co-detection

ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part I
Efficient exact inference for 3d indoor scene understanding

ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part VI
3D2PM - 3d deformable part models

ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part VI
Semantic structure from motion: a novel framework for joint object recognition and 3d reconstruction

Proceedings of the 15th international conference on Theoretical Foundations of Computer Vision: outdoor and large-scale real-world scene analysis
Joint spatio-temporal depth features fusion framework for 3d structure estimation in urban environment

ECCV'12 Proceedings of the 12th international conference on Computer Vision - Volume Part III

Quantified Score

Hi-index	0.00

Visualization

Abstract

Conventional rigid structure from motion (SFM) addresses the problem of recovering the camera parameters (motion) and the 3D locations (structure) of scene points, given observed 2D image feature points. In this paper, we propose a new formulation called Semantic Structure From Motion (SSFM). In addition to the geometrical constraints provided by SFM, SSFM takes advantage of both semantic and geometrical properties associated with objects in the scene (Fig. 1). These properties allow us to recover not only the structure and motion but also the 3D locations, poses, and categories of objects in the scene. We cast this problem as a max-likelihood problem where geometry (cameras, points, objects) and semantic information (object classes) are simultaneously estimated. The key intuition is that, in addition to image features, the measurements of objects across views provide additional geometrical constraints that relate cameras and scene parameters. These constraints make the geometry estimation process more robust and, in turn, make object detection more accurate. Our framework has the unique ability to: i) estimate camera poses only from object detections, ii) enhance camera pose estimation, compared to feature-point-based SFM algorithms, iii) improve object detections given multiple un-calibrated images, compared to independently detecting objects in single images. Extensive quantitative results on three datasets - LiDAR cars, street-view pedestrians, and Kinect office desktop - verify our theoretical claims.