Machine learning in automated text categorization
ACM Computing Surveys (CSUR)
Text genre classification with genre-revealing and subject-revealing features
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features
ECML '98 Proceedings of the 10th European Conference on Machine Learning
HICSS '98 Proceedings of the Thirty-First Annual Hawaii International Conference on System Sciences - Volume 2
Theoretical and Empirical Analysis of ReliefF and RReliefF
Machine Learning
An extensive empirical study of feature selection metrics for text classification
The Journal of Machine Learning Research
Automatic detection of text genre
ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Text genre detection using common word frequencies
COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 2
Automatic Identification of Home Pages on the Web
HICSS '05 Proceedings of the Proceedings of the 38th Annual Hawaii International Conference on System Sciences (HICSS'05) - Track 4 - Volume 04
Multiple sets of features for automatic genre classification of web documents
Information Processing and Management: an International Journal
Effects of web document evolution on genre classification
Proceedings of the 14th ACM international conference on Information and knowledge management
The form is the substance: classification of genres in text
HLTKM '01 Proceedings of the workshop on Human Language Technology and Knowledge Management - Volume 2001
Journal of the American Society for Information Science and Technology
Classifying XML Documents by Using Genre Features
DEXA '07 Proceedings of the 18th International Conference on Database and Expert Systems Applications
Using Visual Features for Fine-Grained Genre Classification of Web Pages
HICSS '08 Proceedings of the Proceedings of the 41st Annual Hawaii International Conference on System Sciences
Examining Variations of Prominent Features in Genre Classification
HICSS '08 Proceedings of the Proceedings of the 41st Annual Hawaii International Conference on System Sciences
An Examination of Genre Attributes for Web Page Classification
HICSS '08 Proceedings of the Proceedings of the 41st Annual Hawaii International Conference on System Sciences
Zero, single, or multi? Genre of web pages through the users' perspective
Information Processing and Management: an International Journal
An N-Gram Based Approach to Automatically Identifying Web Page Genre
HICSS '09 Proceedings of the 42nd Hawaii International Conference on System Sciences
Combinatorial markov random fields and their applications to information organization
Combinatorial markov random fields and their applications to information organization
N-Gram feature selection for authorship identification
AIMSA'06 Proceedings of the 12th international conference on Artificial Intelligence: methodology, Systems, and Applications
Fine-grained genre classification using structural learning algorithms
ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
On identifying academic homepages for digital libraries
Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Evaluating large-scale distributed vertical search
Proceedings of the 9th workshop on Large-scale and distributed informational retrieval
Open-Set classification for automated genre identification
ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval
Hi-index | 0.01 |
Webpages are mainly distinguished by their topic (e.g., politics, sports etc.) and genre (e.g., blogs, homepages, e-shops, etc.). Automatic detection of webpage genre could considerably enhance the ability of modern search engines to focus on the requirements of the user's information need. In this paper, we present an approach to webpage genre detection based on a fully-automated extraction of the feature set that represents the style of webpages. The features we propose (character n-grams of variable length and HTML tags) are language-independent and easily-extracted while they can be adapted to the properties of the still evolving web genres and the noisy environment of the web. Experiments based on two publicly-available corpora show that the performance of the proposed approach is superior in comparison to previously reported results. It is also shown that character n-grams are better features than words when the dimensionality increases while the binary representation is more effective than the term-frequency representation for both feature types. Moreover, we perform a series of cross-check experiments (e.g., training using a genre palette and testing using a different genre palette as well as using the features extracted from one corpus to discriminate the genres of the other corpus) to illustrate the robustness of our approach and its ability to capture the general stylistic properties of genre categories even when the feature set is not optimized for the given corpus.