Max margin learning on domain-independent web information extraction

  • Authors:
  • Bin Zhao;Xiaoxin Yin;Eric P. Xing

  • Affiliations:
  • Carnegie Mellon University, Pittsburgh, PA, USA;Microsoft Research, Redmond, WA, USA;Carnegie Mellon University, Pittsburgh, PA, USA

  • Venue:
  • Proceedings of the 20th ACM international conference on Information and knowledge management
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Domain-independent web information extraction can be addressed as a structured prediction problem where we learn a mapping function from an input web page to the structured and interdependent output variables, labeling each block on the page. In this paper, built upon an HTML parser of Internet Explorer that parses and renders a web page based on HTML tags and visual appearance, we propose a max margin learning approach for web information extraction. Specifically, the output of the parser is a vision tree, which is similar to a DOM tree but with visual information, i.e., how each node is displayed. Based on this hierarchical structure, we develop a max margin learning method for labeling each of its nodes. Due to the rich connections between blocks on the web page, we further introduce edges that connect spatially adjacent nodes on the vision tree, complicating the problem into a cyclic graph labeling task. A max margin learning method on cyclic graphs is developed for this problem, where loopy belief propagation is used for approximate inference. Experimental results on web data extraction show the feasibility and promise of our approach.