Cleaning web pages for effective web content mining

Authors:
Jing Li;C. I. Ezeife
Affiliations:
School of Computer Science, University of Windsor, Windsor, Ontario, Canada;School of Computer Science, University of Windsor, Windsor, Ontario, Canada
Venue:
DEXA'06 Proceedings of the 17th international conference on Database and Expert Systems Applications
Year:
2006

Citing 7
Cited 3

The merge/purge problem for large databases

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Template detection via data mining and its applications

Proceedings of the 11th international conference on World Wide Web
Discovering informative content blocks from Web documents

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Eliminating noisy information in Web pages for data mining

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning block importance models for web pages

Proceedings of the 13th international conference on World Wide Web
Web-page classification through summarization

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Extracting content structure for web pages based on visual representation

APWeb'03 Proceedings of the 5th Asia-Pacific web conference on Web technologies and applications

Effectiveness of template detection on noise reduction and websites summarization

Information Sciences: an International Journal
Towards Comparative Mining of Web Document Objects with NFA: WebOMiner System

International Journal of Data Warehousing and Mining
Rhetorics-based multi-document summarization

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Classifying and mining noise-free web pages will improve on accuracy of search results as well as search speed, and may benefit web-page organization applications (e.g., keyword-based search engines and taxonomic web page categorization applications). Noise on web pages are irrelevant to the main content on the web pages being mined, and include advertisements, navigation bar, and copyright notices. The few existing work on web page cleaning detect noise blocks with exact matching contents but are weak at detecting near duplicate blocks, characterized by items like navigation bars. This paper proposes a system, WebPageCleaner, for eliminating noise blocks from web pages for purposes of improving the accuracy and efficiency of web content mining. A vision-based technique is employed for extracting blocks from web pages. Then, relevant web page blocks are identified as those with high importance level by analyzing such physical features of the blocks as the block location, percentage of web links on the block, and level of similarity of block contents to other blocks. Important blocks are exported to be used for web content mining using Naive Bayes text classification. Experiments show that WebPageCleaner leads to a more accurate and efficient web page classification results than comparable existing approaches.