A heuristic approach for topical information extraction from news pages

Authors:
Yan Liu;Qiang Wang;QingXian Wang
Affiliations:
Information Engineering Institute, Information Engineering University, Zhengzhou, P.R. China;Information Engineering Institute, Information Engineering University, Zhengzhou, P.R. China;Information Engineering Institute, Information Engineering University, Zhengzhou, P.R. China
Venue:
WISE'06 Proceedings of the 7th international conference on Web Information Systems
Year:
2006

Citing 4
Cited 1

Discovering informative content blocks from Web documents

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
DOM-based content extraction of HTML documents

WWW '03 Proceedings of the 12th international conference on World Wide Web
Learning block importance models for web pages

Proceedings of the 13th international conference on World Wide Web
Automatic extraction of informative blocks from webpages

Proceedings of the 2005 ACM symposium on Applied computing

A fast and simple method for extracting relevant content from news webpages

Proceedings of the 18th ACM conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Topical information extraction from news pages could facilitate news searching and retrieval etc. A web page could be partitioned into multiple blocks. The importance of different blocks varies from each other. The estimation of the block importance could be defined as a classification problem. First, an adaptive vision-based page segmentation algorithm is used to partition a web page into semantic blocks. Then spatial features and content features are used to represent each block. Shannon's information entropy is adopted to represent each feature's ability for differentiating. A weighted Naïve Bayes classifier is used to estimate whether the block is important or not. Finally, a variation of TF-IDF is utilized to represent weight of each keyword. As a result, the similar blocks are united as topical region. The approach is tested with several important English and Chinese news sites. Both recall and precision rates are greater than 96%.