Mining web content outliers using structure oriented weighting techniques and N-grams

Authors:
Malik Agyemang;Ken Barker;Rada S. Alhajj
Affiliations:
University of Calgary, AB, Canada;University of Calgary, AB, Canada;University of Calgary, AB, Canada
Venue:
Proceedings of the 2005 ACM symposium on Applied computing
Year:
2005

Citing 10
Cited 8

Computing depth contours of bivariate point clouds

Computational Statistics & Data Analysis - Special issue on classification
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
LOF: identifying density-based local outliers

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Efficient algorithms for mining outliers from large data sets

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Web mining research: a survey

ACM SIGKDD Explorations Newsletter
Mining top-n local outliers in large databases

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
A Study of Approaches to Hypertext Categorization

Journal of Intelligent Information Systems
Fast Outlier Detection in High Dimensional Spaces

PKDD '02 Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery
Algorithms for Mining Distance-Based Outliers in Large Datasets

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Framework for mining web content outliers

Proceedings of the 2004 ACM symposium on Applied computing

Discovering special product features for improving the process of product selection in E-commerce environment

Proceedings of the 11th International Conference on Electronic Commerce
A comprehensive survey of numeric and symbolic outlier mining techniques

Intelligent Data Analysis
FindWDO: a k-nearest neighbors approach for detecting Web document outliers

ACST '08 Proceedings of the Fourth IASTED International Conference on Advances in Computer Science and Technology
Web content outlier mining through mathematical approach and trust rating

ACACOS'11 Proceedings of the 10th WSEAS international conference on Applied computer and applied computational science
Statistical approach for improving the quality of search results

ACACOS'11 Proceedings of the 10th WSEAS international conference on Applied computer and applied computational science
Hybrid approach to web content outlier mining without query vector

DaWaK'05 Proceedings of the 7th international conference on Data Warehousing and Knowledge Discovery
An approach to extract special skills to improve the performance of resume selection

DNIS'10 Proceedings of the 6th international conference on Databases in Networked Information Systems
Mining special features to improve the performance of e-commerce product selection and resume processing

International Journal of Computational Science and Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Classifying text into predefined categories is a fundamental task in information retrieval (IR). IR and web mining techniques have been applied to categorize web pages to enable users to manage and use the huge amount of information available on the web. Thus, developing user-friendly and automated tools for managing web information has been on a higher demand in web mining and information retrieval communities. Text categorization, information routing, identification of junk materials, topic identification and structured search are some of the hot spots in web information management. A great deal of techniques exists for classifying web documents into categories. Interestingly, almost none of the existing algorithms consider documents having 'varying contents' from the rest of the documents taken from the same domain (category) called web content outliers. In this paper, we take advantage of the HTML structure of web and n-gram technique for partial matching of strings and propose an n-gram-based algorithm for mining web content outliers. To reduce the processing time, the optimized algorithm uses only data captured in and tags. Experimental results using planted motifs indicate the proposed n-gram-based algorithm is capable of finding web content outliers. In addition, using texts captured in and tags gave the same results as using text embedded in , , and tags.