Web content outlier mining: motivation, framework, and algorithms

  • Authors:
  • Malik Agyemang

  • Affiliations:
  • University of Calgary (Canada)

  • Venue:
  • Web content outlier mining: motivation, framework, and algorithms
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Data that differ significantly from the norm are considered outliers. Finding outliers from huge data repositories is akin to finding needles in a haystack. Even more challenging is searching for outliers from Web data repositories. The presence of outliers at every data repository cannot be denied in the data mining community of which the Web is not an exception. However, there is neither a formal definition nor known algorithms for mining Web outliers. Secondly, existing outlier mining algorithms designed solely for numeric data cannot be applied directly to mine outliers from Web datasets which contain data of different types (i.e., text, hypertext, video, audio, images, etc.). The thesis establishes the presence of outliers on the Web and provides motivation for mining them. It provides a taxonomy for Web outliers that supports the development of content specific algorithms for mining Web outliers. The thesis discusses a general framework for mining Web outliers but concentrates on designing models for mining Web content outliers. Three algorithms for mining Web content outliers are proposed. The WCOW-Mine algorithm is based on full keyword matching whereas WCON-Mine algorithm uses character n-grams for partial matching of strings. The third algorithm, HyCOQ, uses a hybrid of keywords and n-grams. With slight modifications all three algorithms can either use a domain dictionary or not. The HyCOQ algorithm eliminates the weaknesses in n-gram-based and keyword-based systems. The experimental results reveal all the algorithms are capable of finding Web content outliers. Further, HyCOQ shows huge improvements in accuracy over WCON-Mine and WCOW-Mine with embedded motifs. The results also show irrespective of the algorithm, mining Web content outliers without domain dictionary is more efficient than using one.