A brief survey of web data extraction tools
ACM SIGMOD Record
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Discovering informative content blocks from Web documents
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Automatic extraction of informative blocks from webpages
Proceedings of the 2005 ACM symposium on Applied computing
2D Conditional Random Fields for Web information extraction
ICML '05 Proceedings of the 22nd international conference on Machine learning
Accelerated training of conditional random fields with stochastic gradient methods
ICML '06 Proceedings of the 23rd international conference on Machine learning
A fast and robust method for web page template detection and removal
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Web page title extraction and its application
Information Processing and Management: an International Journal
Page-level template detection via isotonic smoothing
Proceedings of the 16th international conference on World Wide Web
Dynamic hierarchical Markov random fields and their application to web data extraction
Proceedings of the 24th international conference on Machine learning
Adaptive web-page content identification
Proceedings of the 9th annual ACM international workshop on Web information and data management
A densitometric approach to web page segmentation
Proceedings of the 17th ACM conference on Information and knowledge management
Extracting article text from the web with maximum subsequence segmentation
Proceedings of the 18th international conference on World wide web
Can we learn a template-independent wrapper for news article extraction from a single training site?
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Learning to Extract Content from News Webpages
WAINA '09 Proceedings of the 2009 International Conference on Advanced Information Networking and Applications Workshops
Web article extraction for web printing: a DOM+visual based approach
Proceedings of the 9th ACM symposium on Document engineering
Boilerplate detection using shallow text features
Proceedings of the third ACM international conference on Web search and data mining
Loopy belief propagation for approximate inference: an empirical study
UAI'99 Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence
Discriminative probabilistic models for relational data
UAI'02 Proceedings of the Eighteenth conference on Uncertainty in artificial intelligence
Hybrid method for automated news content extraction from the web
WISE'06 Proceedings of the 7th international conference on Web Information Systems
An efficient language-independent method to extract content from news webpages
Proceedings of the 11th ACM symposium on Document engineering
A versatile model for web page representation, information extraction and content re-packaging
Proceedings of the 11th ACM symposium on Document engineering
Structural and visual comparisons for web page archiving
Proceedings of the 2012 ACM symposium on Document engineering
Feature-based object identification for web automation
Proceedings of the 28th Annual ACM Symposium on Applied Computing
Hi-index | 0.00 |
Web content extraction is concerned with the automatic identification of semantically interesting web page regions. To generalize to pages from unknown sites, it is crucial to exploit not only the local characteristics of a particular web page region, but also the rich interdependencies that exist between the regions and their latent semantics. We therefore propose a loopy conditional random field which combines semantic intra-page dependencies derived from both document structure and page layout, uses a realistic set of local and relational features and is efficiently learnt in the tree-based reparameterization framework. The results of our empirical analysis on a corpus of real-world news web pages from 177 distinct sites with multiple annotations on DOM node level demonstrate that our combination of document structure and layout-driven interdependencies leads to a significant error reduction on the semantically interesting regions of a web page.