Web page cleaning for web mining through feature weighting

Authors:
Lan Yi;Bing Liu
Affiliations:
School of Computing, National University of Singapore, Singapore;Department of Computer Science, University of Illinois at Chicago, Chicago, IL
Venue:
IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Year:
2003

Citing 17
Cited 19

Subtopic structuring for full-length document access

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Making large-scale support vector machine learning practical

Advances in kernel methods
Learning to remove Internet advertisements

Proceedings of the third annual conference on Autonomous Agents
Statistical Models for Text Segmentation

Machine Learning - Special issue on natural language learning
Authoritative sources in a hyperlinked environment

Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms
IntelliClean: a knowledge-based intelligent data cleaner

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Enhanced topic distillation using text, markup tags, and hyperlinks

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Template detection via data mining and its applications

Proceedings of the 11th international conference on World Wide Web
Modern Information Retrieval

Modern Information Retrieval
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Entropy-based link analysis for mining web informative structures

Proceedings of the eleventh international conference on Information and knowledge management
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Discovering informative content blocks from Web documents

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
A model of lexical attraction and repulsion

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Cohesion and collocation: using context vectors in text segmentation

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Statistical models for topic segmentation

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics

Learning block importance models for web pages

Proceedings of the 13th international conference on World Wide Web
Using link analysis to improve layout on mobile devices

Proceedings of the 13th international conference on World Wide Web
Editorial: special issue on web content mining

ACM SIGKDD Explorations Newsletter
Learning important models for web page blocks based on layout and content analysis

ACM SIGKDD Explorations Newsletter
Browsing fatigue in handhelds: semantic bookmarking spells relief

WWW '05 Proceedings of the 14th international conference on World Wide Web
The volume and evolution of web page templates

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Mining key information of web pages: A method and its application

Expert Systems with Applications: An International Journal
Two-phase Web site classification based on Hidden Markov Tree models

Web Intelligence and Agent Systems
Page-level template detection via isotonic smoothing

Proceedings of the 16th international conference on World Wide Web
Enhancing web page classification through image-block importance analysis

Information Processing and Management: an International Journal
Site-Independent Template-Block Detection

PKDD 2007 Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases
Web Contents Extracting for Web-Based Learning

ICWL '08 Proceedings of the 7th international conference on Advances in Web Based Learning
Automated Semantic Analysis of Schematic Data

World Wide Web
Extracting article text from the web with maximum subsequence segmentation

Proceedings of the 18th international conference on World wide web
A fast and simple method for extracting relevant content from news webpages

Proceedings of the 18th ACM conference on Information and knowledge management
Finding and using the content texts of HTML pages

AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
Automatic selection of print-worthy content for enhanced web page printing experience

Proceedings of the 10th ACM symposium on Document engineering
On text preprocessing for opinion mining outside of laboratory environments

AMT'12 Proceedings of the 8th international conference on Active Media Technology
A hybrid approach for extracting informative content from web pages

Information Processing and Management: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Unlike conventional data or text, Web pages typically contain a large amount of information that is not part of the main contents of the pages, e.g., banner ads, navigation bars, and copyright notices. Such irrelevant information (which we call Web page noise) in Web pages can seriously harm Web mining, e.g., clustering and classification. In this paper, we propose a novel feature weighting technique to deal with Web page noise to enhance Web mining. This method first builds a compressed structure tree to capture the common structure and comparable blocks in a set of Web pages. It then uses an information based measure to evaluate the importance of each node in the compressed structure tree. Based on the tree and its node importance values, our method assigns a weight to each word feature in its content block. The resulting weights are used in Web mining. We evaluated the proposed technique with two Web mining tasks, Web page clustering and Web page classification. Experimental results show that our weighting method is able to dramatically improve the mining results.