Using register-diversified corpora for general language studies
Computational Linguistics - Special issue on using large corpora: II
Automatic authorship attribution
EACL '99 Proceedings of the ninth conference on European chapter of the Association for Computational Linguistics
Automatic detection of text genre
ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Recognizing text genres with simple metrics using discriminant analysis
COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 2
Text genre classification with genre-revealing and subject-revealing features
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Multiple sets of features for automatic genre classification of web documents
Information Processing and Management: an International Journal
Effects of web document evolution on genre classification
Proceedings of the 14th ACM international conference on Information and knowledge management
The form is the substance: classification of genres in text
HLTKM '01 Proceedings of the workshop on Human Language Technology and Knowledge Management - Volume 2001
Towards genre classification for IR in the workplace
IIiX Proceedings of the 1st international conference on Information interaction in context
Automatic classification of didactic functions of e-learning resources
Proceedings of the 15th international conference on Multimedia
A machine learning approach to reading level assessment
Computer Speech and Language
Opinion Mining and Sentiment Analysis
Foundations and Trends in Information Retrieval
Is Web Genre Identification Feasible?
Proceedings of the 2006 conference on ECAI 2006: 17th European Conference on Artificial Intelligence August 29 -- September 1, 2006, Riva del Garda, Italy
Learning to recognize webpage genres
Information Processing and Management: an International Journal
Classifying factored genres with part-of-speech histograms
NAACL-Short '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers
Genre distinctions for discourse in the Penn TreeBank
ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2
Multiple sets of features for automatic genre classification of web documents
Information Processing and Management: an International Journal
Cybergenre: automatic identification of home pages on the web
Journal of Web Engineering
Punctuation: making a point in unsupervised dependency parsing
CoNLL '11 Proceedings of the Fifteenth Conference on Computational Natural Language Learning
Plagiarism detection based on structural information
Proceedings of the 20th ACM international conference on Information and knowledge management
Automatic genre detection of web documents
IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
Automatic genre identification: towards a flexible classification scheme
FDIA'07 Proceedings of the 1st BCS IRSG conference on Future Directions in Information Access
Using Wikipedia concepts and frequency in language to extract key terms from support documents
Expert Systems with Applications: An International Journal
Cross-lingual genre classification
EACL '12 Proceedings of the Student Research Workshop at the 13th Conference of the European Chapter of the Association for Computational Linguistics
Classifying the socio-situational settings of transcripts of spoken discourses
Speech Communication
Hi-index | 0.00 |
In this paper we present a method for detecting the text genre quickly and easily following an approach originally proposed in authorship attribution studies which uses as style markers the frequencies of occurrence of the most frequent words in a training corpus (Burrows, 1992). In contrast to this approach we use the frequencies of occurrence of the most frequent words of the entire written language. Using as testing ground a part of the Wall Street Journal corpus, we show that the most frequent words of the British National Corpus, representing the most frequent words of the written English language, are more reliable discriminators of text genre in comparison to the most frequent words of the training corpus. Moreover, the frequencies of occurrence of the most common punctuation marks play an important role in terms of accurate text categorization as well as when dealing with training data of limited size.