Entropy-Based Visual Tree Evaluation on Block Extraction

Authors:
Wei-Ting Cho;Yu-Min Lin;Hung-Yu Kao
Affiliations:
-;-;-
Venue:
WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Year:
2009

Citing 11
Cited 0

Discovering informative content blocks from Web documents

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Eliminating noisy information in Web pages for data mining

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Mining data records in Web pages

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning block importance models for web pages

Proceedings of the 13th international conference on World Wide Web
Block-based web search

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Fully automatic wrapper generation for search engines

WWW '05 Proceedings of the 14th international conference on World Wide Web
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
The Mining and Extraction of Primary Informative Blocks and Data Objects from Systematic Web Pages

WI '06 Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
Computing block importance for searching on web sites

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Extracting content structure for web pages based on visual representation

APWeb'03 Proceedings of the 5th Asia-Pacific web conference on Web technologies and applications
Semantic partitioning of web pages

WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

More and More people use Cascading Style Sheets (CSS) to manage their Web pages, because CSS is easy and convenient to typesetting. However, CSS makes a Web page displayed in an ambiguous structure. The data extraction systems that based on mining the Web page structure would generate false judgments for these CSS-rich pages. For solving this issue, we propose a system that applies properties of CSS Web pages to extract data blocks. In this system, Web pages are converted into a visual tree and the entropy attributes of each node in a visual tree is calculated. In the experiment, the result shows the node attributes and the visual tree are useful to extract blocks on CSS Web pages. Our system also outperforms with other systems on container block extraction.