Stylometric analysis of scientific articles

Authors:
Shane Bergsma;Matt Post;David Yarowsky
Affiliations:
Johns Hopkins University, Baltimore, MD;Johns Hopkins University, Baltimore, MD;Johns Hopkins University, Baltimore, MD
Venue:
NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Year:
2012

Citing 34
Cited 0

Tree-adjoining grammars

Handbook of formal languages, vol. 3
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Summarizing scientific articles: experiments with relevance and rhetorical status

Computational Linguistics - Summarization
Transductive Inference for Text Classification using Support Vector Machines

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Automatic text categorization in terms of genre and author

Computational Linguistics
The myth of the double-blind review?: author identification using only citations

ACM SIGKDD Explorations Newsletter
You're not from 'round here, are you?: naive Bayes detection of non-native utterance text

NAACL '01 Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies
Determining an author's native language by mining a text for errors

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Thumbs up?: sentiment classification using machine learning techniques

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Detecting errors in English article usage by non-native speakers

Natural Language Engineering
Coarse-to-fine n-best parsing and MaxEnt discriminative reranking

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Bootstrapping path-based pronoun resolution

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Learning accurate, compact, and interpretable tree annotation

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Linguistic correlates of style: authorship classification with deep linguistic analysis features

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
LIBLINEAR: A Library for Large Linear Classification

The Journal of Machine Learning Research
Scientific paper summarization using citation summary networks

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
The ups and downs of preposition error detection in ESL writing

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Studying the history of ideas using topic models

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Using classifier features for studying the effect of native language on the choice of written second language words

CACLA '07 Proceedings of the Workshop on Cognitive Aspects of Computational Language Acquisition
Detection of grammatical errors involving prepositions

SigSem '07 Proceedings of the Fourth ACL-SIGSEM Workshop on Prepositions
Automatically acquiring models of preposition use

SigSem '07 Proceedings of the Fourth ACL-SIGSEM Workshop on Prepositions
Bayesian learning of a tree substitution grammar

ACLShort '09 Proceedings of the ACL-IJCNLP 2009 Conference Short Papers
The ACL Anthology Network corpus

NLPIR4DL '09 Proceedings of the 2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries
Using mostly native data to correct errors in learners' writing: a meta-classifier approach

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Authorship attribution using probabilistic context-free grammars

ACLShort '10 Proceedings of the ACL 2010 Conference Short Papers
Helping our own: text massaging for computational linguistics as a new shared task

INLG '10 Proceedings of the 6th International Natural Language Generation Conference
Inducing Tree-Substitution Grammars

The Journal of Machine Learning Research
Finding deceptive opinion spam by any stretch of the imagination

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Discovering sociolinguistic associations with structured sparsity

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Judging grammaticality with tree substitution grammar derivations

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Gender attribution: tracing stylometric evidence beyond topic and genre

CoNLL '11 Proceedings of the Fifteenth Conference on Computational Natural Language Learning
A study of academic collaboration in computational linguistics with latent mixtures of authors

LaTeCH '11 Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities
Predicting a scientific community's response to an article

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Exploiting parse structures for native language identification

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present an approach to automatically recover hidden attributes of scientific articles, such as whether the author is a native English speaker, whether the author is a male or a female, and whether the paper was published in a conference or workshop proceedings. We train classifiers to predict these attributes in computational linguistics papers. The classifiers perform well in this challenging domain, identifying non-native writing with 95% accuracy (over a baseline of 67%). We show the benefits of using syntactic features in stylometry; syntax leads to significant improvements over bag-of-words models on all three tasks, achieving 10% to 25% relative error reduction. We give a detailed analysis of which words and syntax most predict a particular attribute, and we show a strong correlation between our predictions and a paper's number of citations.