A comparison of discriminative classifiers for web news content extraction

Authors:
Alex Spengler;Antoine Bordes;Patrick Gallinari
Affiliations:
Université Paris, Paris, France;Université Paris, Paris, France;Université Paris, Paris, France
Venue:
RIAO '10 Adaptivity, Personalization and Fusion of Heterogeneous Information
Year:
2010

Citing 11
Cited 1

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Discovering informative content blocks from Web documents

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
On the algorithmic implementation of multiclass kernel-based vector machines

The Journal of Machine Learning Research
Block-based web search

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Large Margin Methods for Structured and Interdependent Output Variables

The Journal of Machine Learning Research
2D Conditional Random Fields for Web information extraction

ICML '05 Proceedings of the 22nd international conference on Machine learning
Web page title extraction and its application

Information Processing and Management: an International Journal
Dynamic hierarchical Markov random fields and their application to web data extraction

Proceedings of the 24th international conference on Machine learning
Adaptive web-page content identification

Proceedings of the 9th annual ACM international workshop on Web information and data management
Sequence Labelling SVMs Trained in One Pass

ECML PKDD '08 Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases - Part I
Learning to Extract Content from News Webpages

WAINA '09 Proceedings of the 2009 International Conference on Advanced Information Networking and Applications Workshops

An efficient language-independent method to extract content from news webpages

Proceedings of the 11th ACM symposium on Document engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Until now, approaches to web content extraction have focused on random field models, largely neglecting large margin methods. Structured large margin methods, however, have recently shown great practical success. We compare, for the first time, greedy and structured support vector machines with conditional random fields on a real-world web news content extraction task, showing that large margin approaches are indeed competitive with random field models.