Information extraction from HTML: application of a general machine learning approach
AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Generating finite-state transducers for semi-structured data extraction from the Web
Information Systems - Special issue on semistructured data
Information Systems - Special issue on semistructured data
Learning Information Extraction Rules for Semi-Structured and Free Text
Machine Learning - Special issue on natural language learning
IEPAD: information extraction based on pattern discovery
Proceedings of the 10th international conference on World Wide Web
Building intelligent web applications using lightweight wrappers
Data & Knowledge Engineering - Special issue on heterogeneous information resources need semantic access
DEByE - Date extraction by example
Data & Knowledge Engineering
WebOQL: Restructuring Documents, Databases, and Webs
ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
Data extraction and label assignment for web databases
WWW '03 Proceedings of the 12th international conference on World Wide Web
XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources
ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Extracting structured data from Web pages
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
OLERA: Semisupervised Web-Data Extraction with Visual Support
IEEE Intelligent Systems
Web data extraction based on partial tree alignment
WWW '05 Proceedings of the 14th international conference on World Wide Web
Thresher: automating the unwrapping of semantic content from the World Wide Web
WWW '05 Proceedings of the 14th international conference on World Wide Web
The volume and evolution of web page templates
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Content Code Blurring: A New Approach to Content Extraction
DEXA '08 Proceedings of the 2008 19th International Conference on Database and Expert Systems Application
Can we learn a template-independent wrapper for news article extraction from a single training site?
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Template-independent news extraction based on visual consistency
AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 2
CETR: content extraction via tag ratios
Proceedings of the 19th international conference on World wide web
News Filtering and Summarization on the Web
IEEE Intelligent Systems
Web-scale information extraction with vertex
ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
Personalized News Filtering and Summarization on the Web
ICTAI '11 Proceedings of the 2011 IEEE 23rd International Conference on Tools with Artificial Intelligence
NET – a system for extracting web data from flat and nested data records
WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
Hi-index | 0.00 |
In addition to the news content, most web news pages also contain navigation panels, advertisements, related news links etc. These non-news items not only exist outside the news region, but are also present in the news content region. Effectively extracting the news content and filtering the noise have important effects on the follow-up activities of content management and analysis. Our extensive case studies have indicated that there exists potential relevance between web content layouts and their tag paths. Based on this observation, we design two tag path features to measure the importance of nodes: Text to tag Path Ratio (TPR) and Extended Text to tag Path Ratio (ETPR), and describe the calculation process of TPR by traversing the parsing tree of a web news page. In this paper, we present Content Extraction via Path Ratios (CEPR) - a fast, accurate and general on-line method for distinguishing news content from non-news content by the TPR/ETPR histogram effectively. In order to improve the ability of CEPR in extracting short texts, we propose a Gaussian smoothing method weighted by a tag path edit distance. This approach can enhance the importance of internal-link nodes but ignore noise nodes existing in news content. Experimental results on the CleanEval datasets and web news pages randomly selected from well-known websites show that CEPR can extract across multi-resources, multi-styles, and multi-languages. The average F and average score with CEPR is 8.69% and 14.25% higher than CETR, which demonstrates better web news extraction performance than most existing methods.