Classifying websites into non-topical categories

Authors:
Chaman Thapa;Osmar Zaiane;Davood Rafiei;Arya M. Sharma
Affiliations:
University of Alberta, Canada;University of Alberta, Canada;University of Alberta, Canada;University of Alberta, Canada
Venue:
DaWaK'12 Proceedings of the 14th international conference on Data Warehousing and Knowledge Discovery
Year:
2012

Citing 12
Cited 0

On Combining Classifiers

IEEE Transactions on Pattern Analysis and Machine Intelligence
Web site mining: a new way to spot competitors, customers and suppliers in the world wide web

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
The connectivity sonar: detecting site functionality by structural patterns

Proceedings of the fourteenth ACM conference on Hypertext and hypermedia
Automatic detection of text genre

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Detecting online commercial intention (OCI)

Proceedings of the 15th international conference on World Wide Web
Coarse-grained classification of web sites by their structural properties

WIDM '06 Proceedings of the 8th annual ACM international workshop on Web information and data management
A note on Platt's probabilistic outputs for support vector machines

Machine Learning
Document Transformation for Multi-label Feature Selection in Text Categorization

ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
The WEKA data mining software: an update

ACM SIGKDD Explorations Newsletter
A combined topical/non-topical approach to identifying web sites for children

Proceedings of the fourth ACM international conference on Web search and data mining
LIBSVM: A library for support vector machines

ACM Transactions on Intelligent Systems and Technology (TIST)

Quantified Score

Hi-index	0.00

Visualization

Abstract

With the large presence of organizations from different sectors of economy on the web, the problem of detecting to which sector a given website belongs to is both important and challenging. In this paper, we study the problem of classifying websites into four non-topical categories: public, private, non-profit and commercial franchise. Our work treats each website and all pages from the site as a single entity and classifies the entire website as opposed to a single page or a set of pages. We analyze both the textual features including terms, part-of-speech bigrams and named entities and structural features including the link structure of the site and URL patterns. Our experiments on a large set of websites related to weight loss and obesity control, under a multi-label classification setting using the SVM classifier, reveal that with a careful selection and treatment of features based on keywords, one can achieve an F-measure of 70% and that adding structural, part-of-speech and named entity based features further improves the F-measure to 74%. The improvement is more significant when textual features are not accurate or sufficient.