Mining web content outliers using structure oriented weighting techniques and N-grams

  • Authors:
  • Malik Agyemang;Ken Barker;Rada S. Alhajj

  • Affiliations:
  • University of Calgary, AB, Canada;University of Calgary, AB, Canada;University of Calgary, AB, Canada

  • Venue:
  • Proceedings of the 2005 ACM symposium on Applied computing
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Classifying text into predefined categories is a fundamental task in information retrieval (IR). IR and web mining techniques have been applied to categorize web pages to enable users to manage and use the huge amount of information available on the web. Thus, developing user-friendly and automated tools for managing web information has been on a higher demand in web mining and information retrieval communities. Text categorization, information routing, identification of junk materials, topic identification and structured search are some of the hot spots in web information management. A great deal of techniques exists for classifying web documents into categories. Interestingly, almost none of the existing algorithms consider documents having 'varying contents' from the rest of the documents taken from the same domain (category) called web content outliers. In this paper, we take advantage of the HTML structure of web and n-gram technique for partial matching of strings and propose an n-gram-based algorithm for mining web content outliers. To reduce the processing time, the optimized algorithm uses only data captured in and tags. Experimental results using planted motifs indicate the proposed n-gram-based algorithm is capable of finding web content outliers. In addition, using texts captured in and tags gave the same results as using text embedded in , , and tags.