Text genre classification with genre-revealing and subject-revealing features
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
A Comparative Study on Feature Selection in Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Towards Automatic Web Genre Identification
HICSS '02 Proceedings of the 35th Annual Hawaii International Conference on System Sciences (HICSS'02)-Volume 4 - Volume 4
Theoretical and Empirical Analysis of ReliefF and RReliefF
Machine Learning
Effects of web document evolution on genre classification
Proceedings of the 14th ACM international conference on Information and knowledge management
Binary Cybergenre Classification Using Theoretic Feature Measures
WI '06 Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
Genre Categorization of Web Pages
ICDMW '07 Proceedings of the Seventh IEEE International Conference on Data Mining Workshops
Webpage Genre Identification Using Variable-Length Character n-Grams
ICTAI '07 Proceedings of the 19th IEEE International Conference on Tools with Artificial Intelligence - Volume 02
An N-Gram Based Approach to Automatically Identifying Web Page Genre
HICSS '09 Proceedings of the 42nd Hawaii International Conference on System Sciences
Classifying Web Pages by Genre: An n-Gram Approach
WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
N-Gram feature selection for authorship identification
AIMSA'06 Proceedings of the 12th international conference on Artificial Intelligence: methodology, Systems, and Applications
Classifying Web Pages by Genre: An n-Gram Approach
WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Hi-index | 0.00 |
The research reported in this paper is part of a larger project on the classification of Web pages by genre. Such classification is a potentially powerful tool in filtering the results of online searches. In this paper, we describe two sets of experiments investigating the automatic classification of Web pages by their genres. In these experiments, our approach is to represent the Web pages by profiles that are composed of fixed-length byte n-grams. The first set of experiments in this study examines the effect of three feature selection measures on the accuracy of Web page classification. The second set of experiments in this study compares the classification accuracy of three classification methods, each using n-gram representations of the Web pages. The classification methods which are compared are a distance function approach, the k-nearest neighbors method, and the support vector machine approach. We also examine a range of n-gram lengths and a range of Web page profile sizes to determine what combination(s) of n-gram length and profile size give the best classification accuracy. Each set of experiments is run on two well-known data sets, 7-Genre and KI-04, for which published results are available.