Max margin learning on domain-independent web information extraction

Authors:
Bin Zhao;Xiaoxin Yin;Eric P. Xing
Affiliations:
Carnegie Mellon University, Pittsburgh, PA, USA;Microsoft Research, Redmond, WA, USA;Carnegie Mellon University, Pittsburgh, PA, USA
Venue:
Proceedings of the 20th ACM international conference on Information and knowledge management
Year:
2011

Citing 12
Cited 1

Snowball: extracting relations from large plain-text collections

DL '00 Proceedings of the fifth ACM conference on Digital libraries
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
On the algorithmic implementation of multiclass kernel-based vector machines

The Journal of Machine Learning Research
Tree consistency and bounds on the performance of the max-product algorithm and its generalizations

Statistics and Computing
Large Margin Methods for Structured and Interdependent Output Variables

The Journal of Machine Learning Research
Correctness of Belief Propagation in Gaussian Graphical Models of Arbitrary Topology

Neural Computation
Wikify!: linking documents to encyclopedic knowledge

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Graphical Models, Exponential Families, and Variational Inference

Graphical Models, Exponential Families, and Variational Inference
FACTO: a fact lookup engine based on web tables

Proceedings of the 20th international conference on World wide web
Loopy belief propagation for approximate inference: an empirical study

UAI'99 Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence
On the optimality of solutions of the max-product belief-propagation algorithm in arbitrary graphs

IEEE Transactions on Information Theory
Tree-based reparameterization framework for analysis of sum-product and related algorithms

IEEE Transactions on Information Theory

Robust detection of semi-structured web records using a DOM structure-knowledge-driven model

ACM Transactions on the Web (TWEB)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Domain-independent web information extraction can be addressed as a structured prediction problem where we learn a mapping function from an input web page to the structured and interdependent output variables, labeling each block on the page. In this paper, built upon an HTML parser of Internet Explorer that parses and renders a web page based on HTML tags and visual appearance, we propose a max margin learning approach for web information extraction. Specifically, the output of the parser is a vision tree, which is similar to a DOM tree but with visual information, i.e., how each node is displayed. Based on this hierarchical structure, we develop a max margin learning method for labeling each of its nodes. Due to the rich connections between blocks on the web page, we further introduce edges that connect spatially adjacent nodes on the vision tree, complicating the problem into a cyclic graph labeling task. A max margin learning method on cyclic graphs is developed for this problem, where loopy belief propagation is used for approximate inference. Experimental results on web data extraction show the feasibility and promise of our approach.