A hierarchical approach to wrapper induction
Proceedings of the third annual conference on Autonomous Agents
A brief survey of web data extraction tools
ACM SIGMOD Record
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Table extraction using conditional random fields
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Automating Content Extraction of HTML Documents
World Wide Web
Shallow parsing with conditional random fields
NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Columbia Newsblaster: multilingual news summarization on the web
HLT-NAACL--Demonstrations '04 Demonstration Papers at HLT-NAACL 2004
Semantic role labeling as sequential tagging
CONLL '05 Proceedings of the Ninth Conference on Computational Natural Language Learning
Coreex: content extraction from online news articles
Proceedings of the 17th ACM conference on Information and knowledge management
Boilerplate detection using shallow text features
Proceedings of the third ACM international conference on Web search and data mining
Document structure meets page layout: loopy random fields for web news content extraction
Proceedings of the 10th ACM symposium on Document engineering
A comparison of discriminative classifiers for web news content extraction
RIAO '10 Adaptivity, Personalization and Fusion of Heterogeneous Information
Automatic Extraction of Blog Post from Diverse Blog Pages
WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
A hybrid approach for extracting informative content from web pages
Information Processing and Management: an International Journal
Hi-index | 0.00 |
Identifying which parts of a Web-page contain target content (e.g., the portion of an online news page that contains the actual article) is a significant problem that must be addressed for many Web-based applications. Most approaches to this problem involve crafting hand-tailored rules or scripts to extract the content, customized separately for particular Web sites. Besides requiring considerable time and effort to implement, hand-built extraction routines are brittle: they fail to properly extract content in some cases and break when the structure of a site's Web-pages changes. In this work we treat the problem of identifying content as a sequence labeling problem, a common problem structure in machine learning and natural language processing. Using a Conditional Random Field sequence labeling model, we correctly identify the content portion of web-pages anywhere from 80-97% of the time depending on experimental factors such as ensuring the absence of duplicate documents and application of the model against unseen sources.