Learning to recognize webpage genres

Authors:
Ioannis Kanaris;Efstathios Stamatatos
Affiliations:
Dept. of Information and Communication Systems Eng., University of the Aegean, Karlovassi, Samos 83200, Greece;Dept. of Information and Communication Systems Eng., University of the Aegean, Karlovassi, Samos 83200, Greece
Venue:
Information Processing and Management: an International Journal
Year:
2009

Citing 21
Cited 4

Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Text genre classification with genre-revealing and subject-revealing features

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
The Evolution of Cybergenres

HICSS '98 Proceedings of the Thirty-First Annual Hawaii International Conference on System Sciences - Volume 2
Theoretical and Empirical Analysis of ReliefF and RReliefF

Machine Learning
An extensive empirical study of feature selection metrics for text classification

The Journal of Machine Learning Research
Automatic detection of text genre

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Text genre detection using common word frequencies

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 2
Automatic Identification of Home Pages on the Web

HICSS '05 Proceedings of the Proceedings of the 38th Annual Hawaii International Conference on System Sciences (HICSS'05) - Track 4 - Volume 04
Multiple sets of features for automatic genre classification of web documents

Information Processing and Management: an International Journal
Effects of web document evolution on genre classification

Proceedings of the 14th ACM international conference on Information and knowledge management
The form is the substance: classification of genres in text

HLTKM '01 Proceedings of the workshop on Human Language Technology and Knowledge Management - Volume 2001
Learning to classify documents according to genre: Special Topic Section on Computational Analysis of Style

Journal of the American Society for Information Science and Technology
Classifying XML Documents by Using Genre Features

DEXA '07 Proceedings of the 18th International Conference on Database and Expert Systems Applications
Using Visual Features for Fine-Grained Genre Classification of Web Pages

HICSS '08 Proceedings of the Proceedings of the 41st Annual Hawaii International Conference on System Sciences
Examining Variations of Prominent Features in Genre Classification

HICSS '08 Proceedings of the Proceedings of the 41st Annual Hawaii International Conference on System Sciences
An Examination of Genre Attributes for Web Page Classification

HICSS '08 Proceedings of the Proceedings of the 41st Annual Hawaii International Conference on System Sciences
Zero, single, or multi? Genre of web pages through the users' perspective

Information Processing and Management: an International Journal
An N-Gram Based Approach to Automatically Identifying Web Page Genre

HICSS '09 Proceedings of the 42nd Hawaii International Conference on System Sciences
Combinatorial markov random fields and their applications to information organization

Combinatorial markov random fields and their applications to information organization
N-Gram feature selection for authorship identification

AIMSA'06 Proceedings of the 12th international conference on Artificial Intelligence: methodology, Systems, and Applications

Fine-grained genre classification using structural learning algorithms

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
On identifying academic homepages for digital libraries

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Evaluating large-scale distributed vertical search

Proceedings of the 9th workshop on Large-scale and distributed informational retrieval
Open-Set classification for automated genre identification

ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval

Quantified Score

Hi-index	0.01

Visualization

Abstract

Webpages are mainly distinguished by their topic (e.g., politics, sports etc.) and genre (e.g., blogs, homepages, e-shops, etc.). Automatic detection of webpage genre could considerably enhance the ability of modern search engines to focus on the requirements of the user's information need. In this paper, we present an approach to webpage genre detection based on a fully-automated extraction of the feature set that represents the style of webpages. The features we propose (character n-grams of variable length and HTML tags) are language-independent and easily-extracted while they can be adapted to the properties of the still evolving web genres and the noisy environment of the web. Experiments based on two publicly-available corpora show that the performance of the proposed approach is superior in comparison to previously reported results. It is also shown that character n-grams are better features than words when the dimensionality increases while the binary representation is more effective than the term-frequency representation for both feature types. Moreover, we perform a series of cross-check experiments (e.g., training using a genre palette and testing using a different genre palette as well as using the features extracted from one corpus to discriminate the genres of the other corpus) to illustrate the robustness of our approach and its ability to capture the general stylistic properties of genre categories even when the feature set is not optimized for the given corpus.