Semantic structure from motion: a novel framework for joint object recognition and 3d reconstruction

Authors:
Sid Yingze Bao;Silvio Savarese
Affiliations:
The University of Michigan, Ann Arbor, MI;The University of Michigan, Ann Arbor, MI
Venue:
Proceedings of the 15th international conference on Theoretical Foundations of Computer Vision: outdoor and large-scale real-world scene analysis
Year:
2011

Citing 22
Cited 0

Multiple view geometry in computer visiond

Multiple view geometry in computer visiond
From images to 3D models

Communications of the ACM - How the virtual inspires the real
Mean Shift, Mode Seeking, and Clustering

IEEE Transactions on Pattern Analysis and Machine Intelligence
Bundle Adjustment - A Modern Synthesis

ICCV '99 Proceedings of the International Workshop on Vision Algorithms: Theory and Practice
An Efficient Solution to the Five-Point Relative Pose Problem

IEEE Transactions on Pattern Analysis and Machine Intelligence
Distinctive Image Features from Scale-Invariant Keypoints

International Journal of Computer Vision
Modelling and Interpretation of Architecture from Several Images

International Journal of Computer Vision
Histograms of Oriented Gradients for Human Detection

CVPR '05 Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 1 - Volume 01
Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories

CVPR '06 Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2
Depth from Familiar Objects: A Hierarchical Model for 3D Scenes

CVPR '06 Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2
3D Urban Scene Modeling Integrating Recognition and Reconstruction

International Journal of Computer Vision
Putting Objects in Perspective

International Journal of Computer Vision
Modeling the World from Internet Photo Collections

International Journal of Computer Vision
Towards 3D Point cloud based object maps for household environments

Robotics and Autonomous Systems
Segmentation and Recognition Using Structure from Motion Point Clouds

ECCV '08 Proceedings of the 10th European Conference on Computer Vision: Part I
Make3D: Learning 3D Scene Structure from a Single Still Image

IEEE Transactions on Pattern Analysis and Machine Intelligence
Object Detection with Discriminatively Trained Part-Based Models

IEEE Transactions on Pattern Analysis and Machine Intelligence
Multiple viewpoint recognition and localization

ACCV'10 Proceedings of the 10th Asian conference on Computer vision - Volume Part I
Ford Campus vision and lidar data set

International Journal of Robotics Research
A multiview approach to tracking people in crowded scenes using a planar homography constraint

ECCV'06 Proceedings of the 9th European conference on Computer Vision - Volume Part IV
Semantic structure from motion

CVPR '11 Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition
Capturing Time-of-Flight data with confidence

CVPR '11 Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

Conventional rigid structure from motion (SFM) addresses the problem of recovering the camera parameters (motion) and the 3D locations (structure) of scene points, given observed 2D image feature points. In this chapter, we propose a new formulation called Semantic Structure From Motion (SSFM). In addition to the geometrical constraints provided by SFM, SSFM takes advantage of both semantic and geometrical properties associated with objects in a scene. These properties allow to jointly estimate the structure of the scene, the camera parameters as well as the 3D locations, poses, and categories of objects in a scene. We cast this problem as a max-likelihood problem where geometry (cameras, points, objects) and semantic information (object classes) are simultaneously estimated. The key intuition is that, in addition to image features, the measurements of objects across views provide additional geometrical constraints that relate cameras and scene parameters. These constraints make the geometry estimation process more robust and, in turn, make object detection more accurate. Our framework has the unique ability to: i) estimate camera poses only from object detections, ii) enhance camera pose estimation, compared to feature-point-based SFM algorithms, iii) improve object detections given multiple uncalibrated images, compared to independently detecting objects in single images. Extensive quantitative results on three datasets --- LiDAR cars, street-view pedestrians, and Kinect office desktop --- verify our theoretical claims.