Snowball: extracting relations from large plain-text collections
DL '00 Proceedings of the fifth ACM conference on Digital libraries
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features
ECML '98 Proceedings of the 10th European Conference on Machine Learning
On the algorithmic implementation of multiclass kernel-based vector machines
The Journal of Machine Learning Research
Tree consistency and bounds on the performance of the max-product algorithm and its generalizations
Statistics and Computing
Large Margin Methods for Structured and Interdependent Output Variables
The Journal of Machine Learning Research
Wikify!: linking documents to encyclopedic knowledge
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Graphical Models, Exponential Families, and Variational Inference
Graphical Models, Exponential Families, and Variational Inference
FACTO: a fact lookup engine based on web tables
Proceedings of the 20th international conference on World wide web
Loopy belief propagation for approximate inference: an empirical study
UAI'99 Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence
On the optimality of solutions of the max-product belief-propagation algorithm in arbitrary graphs
IEEE Transactions on Information Theory
Tree-based reparameterization framework for analysis of sum-product and related algorithms
IEEE Transactions on Information Theory
Robust detection of semi-structured web records using a DOM structure-knowledge-driven model
ACM Transactions on the Web (TWEB)
Hi-index | 0.00 |
Domain-independent web information extraction can be addressed as a structured prediction problem where we learn a mapping function from an input web page to the structured and interdependent output variables, labeling each block on the page. In this paper, built upon an HTML parser of Internet Explorer that parses and renders a web page based on HTML tags and visual appearance, we propose a max margin learning approach for web information extraction. Specifically, the output of the parser is a vision tree, which is similar to a DOM tree but with visual information, i.e., how each node is displayed. Based on this hierarchical structure, we develop a max margin learning method for labeling each of its nodes. Due to the rich connections between blocks on the web page, we further introduce edges that connect spatially adjacent nodes on the vision tree, complicating the problem into a cyclic graph labeling task. A max margin learning method on cyclic graphs is developed for this problem, where loopy belief propagation is used for approximate inference. Experimental results on web data extraction show the feasibility and promise of our approach.