A heuristic approach for topical information extraction from news pages

  • Authors:
  • Yan Liu;Qiang Wang;QingXian Wang

  • Affiliations:
  • Information Engineering Institute, Information Engineering University, Zhengzhou, P.R. China;Information Engineering Institute, Information Engineering University, Zhengzhou, P.R. China;Information Engineering Institute, Information Engineering University, Zhengzhou, P.R. China

  • Venue:
  • WISE'06 Proceedings of the 7th international conference on Web Information Systems
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Topical information extraction from news pages could facilitate news searching and retrieval etc. A web page could be partitioned into multiple blocks. The importance of different blocks varies from each other. The estimation of the block importance could be defined as a classification problem. First, an adaptive vision-based page segmentation algorithm is used to partition a web page into semantic blocks. Then spatial features and content features are used to represent each block. Shannon's information entropy is adopted to represent each feature's ability for differentiating. A weighted Naïve Bayes classifier is used to estimate whether the block is important or not. Finally, a variation of TF-IDF is utilized to represent weight of each keyword. As a result, the similar blocks are united as topical region. The approach is tested with several important English and Chinese news sites. Both recall and precision rates are greater than 96%.