Syntactic N-grams as machine learning features for natural language processing

Authors:
Grigori Sidorov;Francisco Velasquez;Efstathios Stamatatos;Alexander Gelbukh;Liliana Chanona-Hernández
Affiliations:
Center for Computing Research (CIC), Instituto Politécnico Nacional (IPN), Mexico City, Mexico;Center for Computing Research (CIC), Instituto Politécnico Nacional (IPN), Mexico City, Mexico;University of the Aegean, Greece;Center for Computing Research (CIC), Instituto Politécnico Nacional (IPN), Mexico City, Mexico;ESIME, Instituto Politécnico Nacional (IPN), Mexico City, Mexico
Venue:
Expert Systems with Applications: An International Journal
Year:
2014

Citing 14
Cited 0

Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Authorship Attribution with Support Vector Machines

Applied Intelligence
Applying Authorship Analysis to Extremist-Group Web Forum Messages

IEEE Intelligent Systems
Author verification by linguistic profiling: An exploration of the parameter space

ACM Transactions on Speech and Language Processing (TSLP)
Measuring Differentiability: Unmasking Pseudonymous Authors

The Journal of Machine Learning Research
Authorship attribution

Foundations and Trends in Information Retrieval
A survey of modern authorship attribution methods

Journal of the American Society for Information Science and Technology
Contextual phrase-level polarity analysis using lexical affect scoring and syntactic N-grams

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
N-gram-based statistical machine translation versus syntax augmented machine translation: comparison and system combination

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
The WEKA data mining software: an update

ACM SIGKDD Explorations Newsletter
Authorship attribution in the wild

Language Resources and Evaluation
Local histograms of character N-grams for authorship attribution

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Syntactic dependency-based n-grams: more evidence of usefulness in classification

CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part I
Syntactic dependency-based n-grams as classification features

MICAI'12 Proceedings of the 11th Mexican international conference on Advances in Computational Intelligence - Volume Part II

Quantified Score

Hi-index	12.05

Visualization

Abstract

In this paper we introduce and discuss a concept of syntactic n-grams (sn-grams). Sn-grams differ from traditional n-grams in the manner how we construct them, i.e., what elements are considered neighbors. In case of sn-grams, the neighbors are taken by following syntactic relations in syntactic trees, and not by taking words as they appear in a text, i.e., sn-grams are constructed by following paths in syntactic trees. In this manner, sn-grams allow bringing syntactic knowledge into machine learning methods; still, previous parsing is necessary for their construction. Sn-grams can be applied in any natural language processing (NLP) task where traditional n-grams are used. We describe how sn-grams were applied to authorship attribution. We used as baseline traditional n-grams of words, part of speech (POS) tags and characters; three classifiers were applied: support vector machines (SVM), naive Bayes (NB), and tree classifier J48. Sn-grams give better results with SVM classifier.