Hybrid method for automated news content extraction from the web

Authors:
Yu Li;Xiaofeng Meng;Qing Li;Liping Wang
Affiliations:
School of Information, Renmin Univ. of China, China;School of Information, Renmin Univ. of China, China;Computer Science Dept., City Univ. of Hong Kong, HKSAR, China;Computer Science Dept., City Univ. of Hong Kong, HKSAR, China
Venue:
WISE'06 Proceedings of the 7th international conference on Web Information Systems
Year:
2006

Citing 16
Cited 3

A hierarchical approach to wrapper induction

Proceedings of the third annual conference on Autonomous Agents
IEPAD: information extraction based on pattern discovery

Proceedings of the 10th international conference on World Wide Web
Wrapping-oriented classification of web pages

Proceedings of the 2002 ACM symposium on Applied computing
A brief survey of web data extraction tools

ACM SIGMOD Record
Indexing and Querying XML Data for Regular Path Expressions

Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Data extraction and label assignment for web databases

WWW '03 Proceedings of the 12th international conference on World Wide Web
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Mining data records in Web pages

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Automatic web news extraction using tree edit distance

Proceedings of the 13th international conference on World Wide Web
Fully automatic wrapper generation for search engines

WWW '05 Proceedings of the 14th international conference on World Wide Web
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
Title extraction from bodies of HTML documents and its application to web page retrieval

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Postal Address Detection fromWeb Documents

WIRI '05 Proceedings of the International Workshop on Challenges in Web Information Retrieval and Integration
NET – a system for extracting web data from flat and nested data records

WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
WISE-2005 tutorial: web content mining

WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering

Web Contents Extracting for Web-Based Learning

ICWL '08 Proceedings of the 7th international conference on Advances in Web Based Learning
Document structure meets page layout: loopy random fields for web news content extraction

Proceedings of the 10th ACM symposium on Document engineering
An automatic web news article contents extraction system based on RSS feeds

Journal of Web Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Web news content extraction is vital to improve news indexing and searching in nowadays search engines, especially for the news searching service. In this paper we study the Web news content extraction problem and propose an automated extraction algorithm for it. Our method is a hybrid one taking the advantage of both sequence matching and tree matching techniques. We propose TSReC, a variant of tag sequence representation suitable for both sequence matching and tree matching, along with an associated algorithm for automated Web news content extraction. By implementing a prototype system for Web news content extraction, the empirical evaluation is conducted and the result shows that our method is highly effective and efficient.