Text genre detection using common word frequencies

Authors:
E. Stamatatos;N. Fakotakis;G. Kokkinakis
Affiliations:
University of Patras, Patras, Greece;University of Patras, Patras, Greece;University of Patras, Patras, Greece
Venue:
COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 2
Year:
2000

Citing 4
Cited 21

Using register-diversified corpora for general language studies

Computational Linguistics - Special issue on using large corpora: II
Automatic authorship attribution

EACL '99 Proceedings of the ninth conference on European chapter of the Association for Computational Linguistics
Automatic detection of text genre

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Recognizing text genres with simple metrics using discriminant analysis

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 2

Text genre classification with genre-revealing and subject-revealing features

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Multiple sets of features for automatic genre classification of web documents

Information Processing and Management: an International Journal
Effects of web document evolution on genre classification

Proceedings of the 14th ACM international conference on Information and knowledge management
The form is the substance: classification of genres in text

HLTKM '01 Proceedings of the workshop on Human Language Technology and Knowledge Management - Volume 2001
Towards genre classification for IR in the workplace

IIiX Proceedings of the 1st international conference on Information interaction in context
Automatic classification of didactic functions of e-learning resources

Proceedings of the 15th international conference on Multimedia
A machine learning approach to reading level assessment

Computer Speech and Language
Opinion Mining and Sentiment Analysis

Foundations and Trends in Information Retrieval
Is Web Genre Identification Feasible?

Proceedings of the 2006 conference on ECAI 2006: 17th European Conference on Artificial Intelligence August 29 -- September 1, 2006, Riva del Garda, Italy
Learning to recognize webpage genres

Information Processing and Management: an International Journal
Classifying factored genres with part-of-speech histograms

NAACL-Short '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers
Genre distinctions for discourse in the Penn TreeBank

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2
Multiple sets of features for automatic genre classification of web documents

Information Processing and Management: an International Journal
Cybergenre: automatic identification of home pages on the web

Journal of Web Engineering
Punctuation: making a point in unsupervised dependency parsing

CoNLL '11 Proceedings of the Fifteenth Conference on Computational Natural Language Learning
Plagiarism detection based on structural information

Proceedings of the 20th ACM international conference on Information and knowledge management
Automatic genre detection of web documents

IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
Automatic genre identification: towards a flexible classification scheme

FDIA'07 Proceedings of the 1st BCS IRSG conference on Future Directions in Information Access
Using Wikipedia concepts and frequency in language to extract key terms from support documents

Expert Systems with Applications: An International Journal
Cross-lingual genre classification

EACL '12 Proceedings of the Student Research Workshop at the 13th Conference of the European Chapter of the Association for Computational Linguistics
Classifying the socio-situational settings of transcripts of spoken discourses

Speech Communication

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we present a method for detecting the text genre quickly and easily following an approach originally proposed in authorship attribution studies which uses as style markers the frequencies of occurrence of the most frequent words in a training corpus (Burrows, 1992). In contrast to this approach we use the frequencies of occurrence of the most frequent words of the entire written language. Using as testing ground a part of the Wall Street Journal corpus, we show that the most frequent words of the British National Corpus, representing the most frequent words of the written English language, are more reliable discriminators of text genre in comparison to the most frequent words of the training corpus. Moreover, the frequencies of occurrence of the most common punctuation marks play an important role in terms of accurate text categorization as well as when dealing with training data of limited size.