Classifying websites into non-topical categories

  • Authors:
  • Chaman Thapa;Osmar Zaiane;Davood Rafiei;Arya M. Sharma

  • Affiliations:
  • University of Alberta, Canada;University of Alberta, Canada;University of Alberta, Canada;University of Alberta, Canada

  • Venue:
  • DaWaK'12 Proceedings of the 14th international conference on Data Warehousing and Knowledge Discovery
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

With the large presence of organizations from different sectors of economy on the web, the problem of detecting to which sector a given website belongs to is both important and challenging. In this paper, we study the problem of classifying websites into four non-topical categories: public, private, non-profit and commercial franchise. Our work treats each website and all pages from the site as a single entity and classifies the entire website as opposed to a single page or a set of pages. We analyze both the textual features including terms, part-of-speech bigrams and named entities and structural features including the link structure of the site and URL patterns. Our experiments on a large set of websites related to weight loss and obesity control, under a multi-label classification setting using the SVM classifier, reveal that with a careful selection and treatment of features based on keywords, one can achieve an F-measure of 70% and that adding structural, part-of-speech and named entity based features further improves the F-measure to 74%. The improvement is more significant when textual features are not accurate or sufficient.