Road scene segmentation from a single image

  • Authors:
  • Jose M. Alvarez;Theo Gevers;Yann LeCun;Antonio M. Lopez

  • Affiliations:
  • Courant Institute of Mathematical Sciences, New York University, New York, NY, USA,Computer Vision Center, Univ. Autònoma de Barcelona, Barcelona, Spain;Faculty of Science, University of Amsterdam, Amsterdam, The Netherlands,Computer Vision Center, Univ. Autònoma de Barcelona, Barcelona, Spain;Courant Institute of Mathematical Sciences, New York University, New York, NY;Computer Vision Center, Univ. Autònoma de Barcelona, Barcelona, Spain

  • Venue:
  • ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part VII
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Road scene segmentation is important in computer vision for different applications such as autonomous driving and pedestrian detection. Recovering the 3D structure of road scenes provides relevant contextual information to improve their understanding. In this paper, we use a convolutional neural network based algorithm to learn features from noisy labels to recover the 3D scene layout of a road image. The novelty of the algorithm relies on generating training labels by applying an algorithm trained on a general image dataset to classify on–board images. Further, we propose a novel texture descriptor based on a learned color plane fusion to obtain maximal uniformity in road areas. Finally, acquired (off–line) and current (on–line) information are combined to detect road areas in single images. From quantitative and qualitative experiments, conducted on publicly available datasets, it is concluded that convolutional neural networks are suitable for learning 3D scene layout from noisy labels and provides a relative improvement of 7% compared to the baseline. Furthermore, combining color planes provides a statistical description of road areas that exhibits maximal uniformity and provides a relative improvement of 8% compared to the baseline. Finally, the improvement is even bigger when acquired and current information from a single image are combined.