Theme Extraction from Chinese Web Documents Based on Page Segmentation and Entropy

  • Authors:
  • Deqing Wang;Hui Zhang;Gang Zhou

  • Affiliations:
  • State Key Lab of Software Development Environment, Beihang University, Beijing, P.R. China 100191;State Key Lab of Software Development Environment, Beihang University, Beijing, P.R. China 100191;State Key Lab of Software Development Environment, Beihang University, Beijing, P.R. China 100191

  • Venue:
  • ISMIS '09 Proceedings of the 18th International Symposium on Foundations of Intelligent Systems
  • Year:
  • 2009

Quantified Score

Hi-index 0.01

Visualization

Abstract

Web pages often contain "clutters" (defined by us as unnecessary images, navigational menus and extraneous Ad links) around the body of an article that may distract users from the actual content. Therefore, how to extract useful and relevant themes from these web pages becomes a research focus. This paper proposes a new method for web theme extraction. The method firstly uses page segmentation technique to divide a web page into many unrelated blocks, and then calculates entropy of each block and that of the entire web page, then prunes redundant blocks whose entropies are larger than the threshold of the web page, lastly exports the rest blocks as theme of the web page. Moreover, it is verified by experiments that the new method takes better effect on theme extraction from Chinese web pages.