Exploiting parse structures for native language identification

Authors:
Sze-Meng Jojo Wong;Mark Dras
Affiliations:
Macquarie University, Sydney, Australia;Macquarie University, Sydney, Australia
Venue:
EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Year:
2011

Citing 19
Cited 4

A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Discriminative Reranking for Natural Language Parsing

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
An extensive empirical study of feature selection metrics for text classification

The Journal of Machine Learning Research
Estimators for stochastic "Unification-Based" grammars

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Introduction to Machine Learning (Adaptive Computation and Machine Learning)

Introduction to Machine Learning (Adaptive Computation and Machine Learning)
Accurate unlexicalized parsing

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Coarse-to-fine n-best parsing and MaxEnt discriminative reranking

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Learning accurate, compact, and interpretable tree annotation

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Linguistic correlates of style: authorship classification with deep linguistic analysis features

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Learning to detect phishing emails

Proceedings of the 16th international conference on World Wide Web
Wide-coverage efficient statistical parsing with ccg and log-linear models

Computational Linguistics
Adapting a WSJ-trained parser to grammatically noisy text

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Source language markers in EUROPARL translations

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Using classifier features for studying the effect of native language on the choice of written second language words

CACLA '07 Proceedings of the Workshop on Cognitive Aspects of Computational Language Acquisition
Natural Language Processing with Python

Natural Language Processing with Python
Authorship analysis in cybercrime investigation

ISI'03 Proceedings of the 1st NSF/NIJ conference on Intelligence and security informatics
Reranking the Berkeley and brown parsers

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Using parse features for preposition selection and error detection

ACLShort '10 Proceedings of the ACL 2010 Conference Short Papers
Automatically determining an anonymous author's native language

ISI'05 Proceedings of the 2005 IEEE international conference on Intelligence and Security Informatics

Stylometric analysis of scientific articles

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Native language detection with tree substitution grammars

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2
Exploring adaptor grammars for native language identification

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Characterizing stylistic elements in syntactic structure

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

Attempts to profile authors according to their characteristics extracted from textual data, including native language, have drawn attention in recent years, via various machine learning approaches utilising mostly lexical features. Drawing on the idea of contrastive analysis, which postulates that syntactic errors in a text are to some extent influenced by the native language of an author, this paper explores the usefulness of syntactic features for native language identification. We take two types of parse substructure as features---horizontal slices of trees, and the more general feature schemas from discriminative parse reranking---and show that using this kind of syntactic feature results in an accuracy score in classification of seven native languages of around 80%, an error reduction of more than 30%.