Automatic text categorization in terms of genre and author

Authors:
Efstathios Stamatatos;George Kokkinakis;Nikos Fakotakis
Affiliations:
University of Patras;University of Patras;University of Patras
Venue:
Computational Linguistics
Year:
2000

Citing 11
Cited 48

An investigation of Morton's method to distinguish Elizabethan playwrights

Computers and the Humanities
A text-independent speaker recognition system based on vowel spotting

Speech Communication
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
An Evaluation of Statistical Approaches to Text Categorization

Information Retrieval
Computer Methods for Literary Research

Computer Methods for Literary Research
An Empirical Text Categorizing Computational Model Based on Stylistic Aspects

ICTAI '96 Proceedings of the 8th International Conference on Tools with Artificial Intelligence
Using register-diversified corpora for general language studies

Computational Linguistics - Special issue on using large corpora: II
Robust text processing in automated information retrieval

ANLC '94 Proceedings of the fourth conference on Applied natural language processing
Automatic detection of text genre

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Recognizing text genres with simple metrics using discriminant analysis

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 2
Using Multivariate Statistics (5th Edition)

Using Multivariate Statistics (5th Edition)

Style mining of electronic messages for multiple authorship discrimination: first results

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Augmenting Naive Bayes Classifiers with Statistical Language Models

Information Retrieval
Co-trained support vector machines for large scale unstructured document classification using unlabeled data and syntactic information

Information Processing and Management: an International Journal
Music artist style identification by semi-supervised learning from both lyrics and content

Proceedings of the 12th annual ACM international conference on Multimedia
On combining multiple clusterings

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Language independent authorship attribution using character level language models

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
Language and task independent text categorization with simple language models

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Multiple sets of features for automatic genre classification of web documents

Information Processing and Management: an International Journal
Segmenting documents by stylistic character

Natural Language Engineering
From fingerprint to writeprint

Communications of the ACM - Supporting exploratory search
Effective identification of source code authors using byte-level information

Proceedings of the 28th international conference on Software engineering
Towards practical genre classification of web documents

Proceedings of the 15th international conference on World Wide Web
Extracting key-substring-group features for text classification

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
A quantitative analysis of lexical differences between genders in telephone conversations

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Linguistic correlates of style: authorship classification with deep linguistic analysis features

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Stylistic text classification using functional lexical features: Research Articles

Journal of the American Society for Information Science and Technology
Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace

ACM Transactions on Information Systems (TOIS)
Examining the significance of high-level programming features in source code author classification

Journal of Systems and Software
Author identification: Using text sampling to handle the class imbalance problem

Information Processing and Management: an International Journal
Web page genre classification

Proceedings of the 2008 ACM symposium on Applied computing
Chat mining: Predicting user and message attributes in computer-mediated communication

Information Processing and Management: an International Journal
Tensor Space Models for Authorship Identification

SETN '08 Proceedings of the 5th Hellenic conference on Artificial Intelligence: Theories, Models and Applications
A survey of modern authorship attribution methods

Journal of the American Society for Information Science and Technology
Text Sampling and Re-sampling for Imbalanced Authorship Identification Cases

Proceedings of the 2006 conference on ECAI 2006: 17th European Conference on Artificial Intelligence August 29 -- September 1, 2006, Riva del Garda, Italy
A Genre-Aware Approach to Focused Crawling

World Wide Web
Multiple sets of features for automatic genre classification of web documents

Information Processing and Management: an International Journal
Combining naive bayes and n-gram language models for text classification

ECIR'03 Proceedings of the 25th European conference on IR research
Large scale unstructured document classification using unlabeled data and syntactic information

PAKDD'03 Proceedings of the 7th Pacific-Asia conference on Advances in knowledge discovery and data mining
A classifier system for author recognition using synonym-based features

MICAI'07 Proceedings of the artificial intelligence 6th Mexican international conference on Advances in artificial intelligence
Automatic genre classification by using co-training

FSKD'09 Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 1
On combining multiple clusterings: an overview and a new perspective

Applied Intelligence
Inferring gender of movie reviewers: exploiting writing style, content and metadata

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Towards style transformation from written-style to audio-style

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Automatic genre detection of web documents

IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
Content-based mobile spam classification using stylistically motivated features

Pattern Recognition Letters
N-Gram feature selection for authorship identification

AIMSA'06 Proceedings of the 12th international conference on Artificial Intelligence: methodology, Systems, and Applications
A computer-assisted qualitative data analysis framework for the engineering management domain

International Journal of Data Analysis Techniques and Strategies
Automatic turkish text categorization in terms of author, genre and gender

NLDB'06 Proceedings of the 11th international conference on Applications of Natural Language to Information Systems
Authorship Attribution Based on Specific Vocabulary

ACM Transactions on Information Systems (TOIS)
Distinguishing venues by writing styles

Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries
Representation models for text classification: a comparative analysis over three web document types

Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics
Implicit group membership detection in online text: analysis and applications

SBP'12 Proceedings of the 5th international conference on Social Computing, Behavioral-Cultural Modeling and Prediction
Mining writeprints from anonymous e-mails for forensic investigation

Digital Investigation: The International Journal of Digital Forensics & Incident Response
Plag-Inn: intrinsic plagiarism detection using grammar trees

NLDB'12 Proceedings of the 17th international conference on Applications of Natural Language Processing and Information Systems
Cross-lingual genre classification

EACL '12 Proceedings of the Student Research Workshop at the 13th Conference of the European Chapter of the Association for Computational Linguistics
Stylometric analysis of scientific articles

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Characterizing stylistic elements in syntactic structure

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Recognition of word collocation habits using frequency rank ratio and inter-term intimacy

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

The two main factors that characterize a text are its content and its style, and both can be used as a means of categorization. In this paper we present an approach to text categorization in terms of genre and author for Modern Greek. In contrast to previous stylometric approaches, we attempt to take full advantage of existing natural language processing (NLP) tools. To this end, we propose a set of style markers including analysis-level measures that represent the way in which the input text has been analyzed and capture useful stylistic information without additional cost. We present a set of small-scale but reasonable experiments in text genre detection, author identification, and author verification tasks and show that the proposed method performs better than the most popular distributional lexical measures, i.e., functions of vocabulary richness and frequencies of occurrence of the most frequent words. All the presented experiments are based on unrestricted text downloaded from the World Wide Web without any manual text preprocessing or text sampling. Various performance issues regarding the training set size and the significance of the proposed style markers are discussed. Our system can be used in any application that requires fast and easily adaptable text categorization in terms of stylistically homogeneous categories. Moreover, the procedure of defining analysis-level markers can be followed in order to extract useful stylistic information using existing text processing tools.