IEEE Transactions on Pattern Analysis and Machine Intelligence
Web site mining: a new way to spot competitors, customers and suppliers in the world wide web
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
The connectivity sonar: detecting site functionality by structural patterns
Proceedings of the fourteenth ACM conference on Hypertext and hypermedia
Automatic detection of text genre
ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews
ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Detecting online commercial intention (OCI)
Proceedings of the 15th international conference on World Wide Web
Coarse-grained classification of web sites by their structural properties
WIDM '06 Proceedings of the 8th annual ACM international workshop on Web information and data management
A note on Platt's probabilistic outputs for support vector machines
Machine Learning
Document Transformation for Multi-label Feature Selection in Text Categorization
ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
The WEKA data mining software: an update
ACM SIGKDD Explorations Newsletter
A combined topical/non-topical approach to identifying web sites for children
Proceedings of the fourth ACM international conference on Web search and data mining
LIBSVM: A library for support vector machines
ACM Transactions on Intelligent Systems and Technology (TIST)
Hi-index | 0.00 |
With the large presence of organizations from different sectors of economy on the web, the problem of detecting to which sector a given website belongs to is both important and challenging. In this paper, we study the problem of classifying websites into four non-topical categories: public, private, non-profit and commercial franchise. Our work treats each website and all pages from the site as a single entity and classifies the entire website as opposed to a single page or a set of pages. We analyze both the textual features including terms, part-of-speech bigrams and named entities and structural features including the link structure of the site and URL patterns. Our experiments on a large set of websites related to weight loss and obesity control, under a multi-label classification setting using the SVM classifier, reveal that with a careful selection and treatment of features based on keywords, one can achieve an F-measure of 70% and that adding structural, part-of-speech and named entity based features further improves the F-measure to 74%. The improvement is more significant when textual features are not accurate or sufficient.