Theme Extraction from Chinese Web Documents Based on Page Segmentation and Entropy

Authors:
Deqing Wang;Hui Zhang;Gang Zhou
Affiliations:
State Key Lab of Software Development Environment, Beihang University, Beijing, P.R. China 100191;State Key Lab of Software Development Environment, Beihang University, Beijing, P.R. China 100191;State Key Lab of Software Development Environment, Beihang University, Beijing, P.R. China 100191
Venue:
ISMIS '09 Proceedings of the 18th International Symposium on Foundations of Intelligent Systems
Year:
2009

Citing 6
Cited 0

Generating finite-state transducers for semi-structured data extraction from the Web

Information Systems - Special issue on semistructured data
Accordion summarization for end-game browsing on PDAs and cellular phones

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Building intelligent web applications using lightweight wrappers

Data & Knowledge Engineering - Special issue on heterogeneous information resources need semantic access
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Discovering informative content blocks from Web documents

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
DOM-based content extraction of HTML documents

WWW '03 Proceedings of the 12th international conference on World Wide Web

Quantified Score

Hi-index	0.01

Visualization

Abstract

Web pages often contain "clutters" (defined by us as unnecessary images, navigational menus and extraneous Ad links) around the body of an article that may distract users from the actual content. Therefore, how to extract useful and relevant themes from these web pages becomes a research focus. This paper proposes a new method for web theme extraction. The method firstly uses page segmentation technique to divide a web page into many unrelated blocks, and then calculates entropy of each block and that of the entire web page, then prunes redundant blocks whose entropies are larger than the threshold of the web page, lastly exports the rest blocks as theme of the web page. Moreover, it is verified by experiments that the new method takes better effect on theme extraction from Chinese web pages.