Eliminating noisy information in Web pages for data mining

Authors:
Lan Yi;Bing Liu;Xiaoli Li
Affiliations:
National University of Singapore, Singapore;University of Illinois at Chicago, Chicago, IL;National University of Singapore, Singapore
Venue:
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2003

Citing 13
Cited 70

A sequential algorithm for training text classifiers

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Learning to remove Internet advertisements

Proceedings of the third annual conference on Autonomous Agents
Statistical Models for Text Segmentation

Machine Learning - Special issue on natural language learning
Authoritative sources in a hyperlinked environment

Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms
IntelliClean: a knowledge-based intelligent data cleaner

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Template detection via data mining and its applications

Proceedings of the 11th international conference on World Wide Web
Entropy-based link analysis for mining web informative structures

Proceedings of the eleventh international conference on Information and knowledge management
Data Mining for Web Intelligence

Computer
Mining the Web: Discovering Knowledge from HyperText Data

Mining the Web: Discovering Knowledge from HyperText Data
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Discovering informative content blocks from Web documents

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
A model of lexical attraction and repulsion

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics

Learning block importance models for web pages

Proceedings of the 13th international conference on World Wide Web
Using link analysis to improve layout on mobile devices

Proceedings of the 13th international conference on World Wide Web
Web-page classification through summarization

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Detecting and Partitioning Data Objects in Complex Web Pages

WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence
Editorial: special issue on web content mining

ACM SIGKDD Explorations Newsletter
Learning important models for web page blocks based on layout and content analysis

ACM SIGKDD Explorations Newsletter
Bootstrapping Semantic Annotation for Content-Rich HTML Documents

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Browsing fatigue in handhelds: semantic bookmarking spells relief

WWW '05 Proceedings of the 14th international conference on World Wide Web
The volume and evolution of web page templates

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Automatic extraction of informative blocks from webpages

Proceedings of the 2005 ACM symposium on Applied computing
Email data cleaning

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Enhancing Data Analysis with Noise Removal

IEEE Transactions on Knowledge and Data Engineering
Learning Object Models from Semistructured Web Documents

IEEE Transactions on Knowledge and Data Engineering
Model-directed web transactions under constrained modalities

Proceedings of the 15th international conference on World Wide Web
Template detection for large scale search engines

Proceedings of the 2006 ACM symposium on Applied computing
Simultaneous record detection and attribute labeling in web data extraction

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
A Survey of Web Information Extraction Systems

IEEE Transactions on Knowledge and Data Engineering
A fast and robust method for web page template detection and removal

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Logical structure analysis: From HTML to XML

Computer Standards & Interfaces
Combined mining of Web server logs and web contents for classifying user navigation patterns and predicting users' future requests

Data & Knowledge Engineering
Page-level template detection via isotonic smoothing

Proceedings of the 16th international conference on World Wide Web
Context browsing with mobiles - when less is more

Proceedings of the 5th international conference on Mobile systems, applications and services
Model-directed Web transactions under constrained modalities

ACM Transactions on the Web (TWEB)
Noise reduction through summarization for Web-page classification

Information Processing and Management: an International Journal
Computing block importance for searching on web sites

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Near-replicas of web pages detection efficient algorithm based on single MD5 fingerprint

ICAI'07 Proceedings of the 8th Conference on 8th WSEAS International Conference on Automation and Information - Volume 8
Incremental web page template detection

Proceedings of the 17th international conference on World Wide Web
Efficient algorithms for incremental Web log mining with dynamic thresholds

The VLDB Journal — The International Journal on Very Large Data Bases
Learning from multi-topic web documents for contextual advertisement

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Learning to Classify Documents with Only a Small Positive Training Set

ECML '07 Proceedings of the 18th European conference on Machine Learning
Site-Independent Template-Block Detection

PKDD 2007 Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases
Web Contents Extracting for Web-Based Learning

ICWL '08 Proceedings of the 7th international conference on Advances in Web Based Learning
Automated Semantic Analysis of Schematic Data

World Wide Web
A densitometric approach to web page segmentation

Proceedings of the 17th ACM conference on Information and knowledge management
Combining content extraction heuristics: the CombinE system

Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
Extracting article text from the web with maximum subsequence segmentation

Proceedings of the 18th international conference on World wide web
On Finding Templates on Web Collections

World Wide Web
Deriving image-text document surrogates to optimize cognition

Proceedings of the 9th ACM symposium on Document engineering
Web document text and images extraction using DOM analysis and natural language processing

Proceedings of the 9th ACM symposium on Document engineering
Entropy-Based Visual Tree Evaluation on Block Extraction

WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Bridging the gap: from multi document Template Detection to single document Content Extraction

EuroIMSA '08 Proceedings of the IASTED International Conference on Internet and Multimedia Systems and Applications
Boilerplate detection using shallow text features

Proceedings of the third ACM international conference on Web search and data mining
Semantic Web Mining

Web Semantics: Science, Services and Agents on the World Wide Web
CETR: content extraction via tag ratios

Proceedings of the 19th international conference on World wide web
Web mediators for accessible browsing

ERCIM'06 Proceedings of the 9th conference on User interfaces for all
Finding and using the content texts of HTML pages

AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
Clustering template based web documents

ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
Web page DOM node characterization and its application to page segmentation

IMSAA'09 Proceedings of the 3rd IEEE international conference on Internet multimedia services architecture and applications
Improving mention detection robustness to noisy input

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Automatic sitemaps generation: Exploring website structures using block extraction and hyperlink analysis

Expert Systems with Applications: An International Journal
Prediction of web page accessibility based on structural and textual features

Proceedings of the International Cross-Disciplinary Conference on Web Accessibility
A site oriented method for segmenting web pages

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
DOM based content extraction via text density

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
A preprocessing framework and approach for web applications

Journal of Web Engineering
Segmenting eBay item descriptions into coherent sections

Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data
Indexing and querying segmented web pages: the BlockWeb Model

World Wide Web
Privacy protected knowledge management in services with emphasis on quality data

Proceedings of the 20th ACM international conference on Information and knowledge management
A tool for link-based web page classification

CAEPIA'11 Proceedings of the 14th international conference on Advances in artificial intelligence: spanish association for artificial intelligence
Block-based language modeling approach towards web search

APWeb'05 Proceedings of the 7th Asia-Pacific web conference on Web Technologies Research and Development
Classification of news web documents based on structural features

FinTAL'06 Proceedings of the 5th international conference on Advances in Natural Language Processing
Identifying content blocks from web documents

ISMIS'05 Proceedings of the 15th international conference on Foundations of Intelligent Systems
An intelligent extracting web content agent on the internet

KES'05 Proceedings of the 9th international conference on Knowledge-Based Intelligent Information and Engineering Systems - Volume Part II
Towards understanding the functions of web element

AIRS'04 Proceedings of the 2004 international conference on Asian Information Retrieval Technology
Cleaning web pages for effective web content mining

DEXA'06 Proceedings of the 17th international conference on Database and Expert Systems Applications
Extracting informative textual parts from web pages containing user-generated content

Proceedings of the 12th International Conference on Knowledge Management and Knowledge Technologies
Effectiveness of template detection on noise reduction and websites summarization

Information Sciences: an International Journal
Webzeitgeist: design mining the web

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Cluster-based page segmentation-a fast and precise method for web page pre-processing

Proceedings of the 3rd International Conference on Web Intelligence, Mining and Semantics
A hybrid approach for extracting informative content from web pages

Information Processing and Management: an International Journal
Heuristic role detection of visual elements of web pages

ICWE'13 Proceedings of the 13th international conference on Web Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

A commercial Web page typically contains many information blocks. Apart from the main content blocks, it usually has such blocks as navigation panels, copyright and privacy notices, and advertisements (for business purposes and for easy user access). We call these blocks that are not the main content blocks of the page the noisy blocks. We show that the information contained in these noisy blocks can seriously harm Web data mining. Eliminating these noises is thus of great importance. In this paper, we propose a noise elimination technique based on the following observation: In a given Web site, noisy blocks usually share some common contents and presentation styles, while the main content blocks of the pages are often diverse in their actual contents and/or presentation styles. Based on this observation, we propose a tree structure, called Style Tree, to capture the common presentation styles and the actual contents of the pages in a given Web site. By sampling the pages of the site, a Style Tree can be built for the site, which we call the Site Style Tree (SST). We then introduce an information based measure to determine which parts of the SST represent noises and which parts represent the main contents of the site. The SST is employed to detect and eliminate noises in any Web page of the site by mapping this page to the SST. The proposed technique is evaluated with two data mining tasks, Web page clustering and classification. Experimental results show that our noise elimination technique is able to improve the mining results significantly.