Hybrid approach to web content outlier mining without query vector

Authors:
Malik Agyemang;Ken Barker;Reda Alhajj
Affiliations:
Department of Computer Science, University of Calgary, Calgary, Alberta, Canada;Department of Computer Science, University of Calgary, Calgary, Alberta, Canada;,Department of Computer Science, University of Calgary, Calgary, Alberta, Canada
Venue:
DaWaK'05 Proceedings of the 7th international conference on Data Warehousing and Knowledge Discovery
Year:
2005

Citing 11
Cited 3

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
The World-Wide Web: quagmire or gold mine?

Communications of the ACM
LOF: identifying density-based local outliers

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Efficient algorithms for mining outliers from large data sets

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Discovering unexpected information from your competitors' web sites

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Mining top-n local outliers in large databases

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Mining the Web's Link Structure

Computer
Algorithms for Mining Distance-Based Outliers in Large Datasets

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Data mining for hypertext: a tutorial survey

ACM SIGKDD Explorations Newsletter
Framework for mining web content outliers

Proceedings of the 2004 ACM symposium on Applied computing
Mining web content outliers using structure oriented weighting techniques and N-grams

Proceedings of the 2005 ACM symposium on Applied computing

A comprehensive survey of numeric and symbolic outlier mining techniques

Intelligent Data Analysis
Web content outlier mining through mathematical approach and trust rating

ACACOS'11 Proceedings of the 10th WSEAS international conference on Applied computer and applied computational science
Detecting outlier sections in us congressional legislation

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Mining outliers from large datasets is like finding needles in a haystack. Even more challenging is sifting through the dynamic, unstructured, and ever-growing web data for outliers. This paper presents HyCOQ, which is a hybrid algorithm that draws from the power of n-gram-based and word-based systems. Experimental results obtained using embedded motifs without a dictionary show significant improvement over using a domain dictionary irrespective of the type of data used (words, n-grams, or hybrid). Also, there is remarkable improvement in recall with hybrid documents compared to using raw words and n-grams without a domain dictionary.