Computing depth contours of bivariate point clouds
Computational Statistics & Data Analysis - Special issue on classification
Focused crawling: a new approach to topic-specific Web resource discovery
WWW '99 Proceedings of the eighth international conference on World Wide Web
LOF: identifying density-based local outliers
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Efficient algorithms for mining outliers from large data sets
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
ACM SIGKDD Explorations Newsletter
Mining top-n local outliers in large databases
Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
A Study of Approaches to Hypertext Categorization
Journal of Intelligent Information Systems
Fast Outlier Detection in High Dimensional Spaces
PKDD '02 Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery
Algorithms for Mining Distance-Based Outliers in Large Datasets
VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Framework for mining web content outliers
Proceedings of the 2004 ACM symposium on Applied computing
Proceedings of the 11th International Conference on Electronic Commerce
A comprehensive survey of numeric and symbolic outlier mining techniques
Intelligent Data Analysis
FindWDO: a k-nearest neighbors approach for detecting Web document outliers
ACST '08 Proceedings of the Fourth IASTED International Conference on Advances in Computer Science and Technology
Web content outlier mining through mathematical approach and trust rating
ACACOS'11 Proceedings of the 10th WSEAS international conference on Applied computer and applied computational science
Statistical approach for improving the quality of search results
ACACOS'11 Proceedings of the 10th WSEAS international conference on Applied computer and applied computational science
Hybrid approach to web content outlier mining without query vector
DaWaK'05 Proceedings of the 7th international conference on Data Warehousing and Knowledge Discovery
An approach to extract special skills to improve the performance of resume selection
DNIS'10 Proceedings of the 6th international conference on Databases in Networked Information Systems
International Journal of Computational Science and Engineering
Hi-index | 0.00 |
Classifying text into predefined categories is a fundamental task in information retrieval (IR). IR and web mining techniques have been applied to categorize web pages to enable users to manage and use the huge amount of information available on the web. Thus, developing user-friendly and automated tools for managing web information has been on a higher demand in web mining and information retrieval communities. Text categorization, information routing, identification of junk materials, topic identification and structured search are some of the hot spots in web information management. A great deal of techniques exists for classifying web documents into categories. Interestingly, almost none of the existing algorithms consider documents having 'varying contents' from the rest of the documents taken from the same domain (category) called web content outliers. In this paper, we take advantage of the HTML structure of web and n-gram technique for partial matching of strings and propose an n-gram-based algorithm for mining web content outliers. To reduce the processing time, the optimized algorithm uses only data captured in and tags. Experimental results using planted motifs indicate the proposed n-gram-based algorithm is capable of finding web content outliers. In addition, using texts captured in and tags gave the same results as using text embedded in , , and tags.