Better Arabic parsing: baselines, evaluations, and analysis

Authors:
Spence Green;Christopher D. Manning
Affiliations:
Stanford University;Stanford University
Venue:
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Year:
2010

Citing 25
Cited 15

Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
A test of the leaf-ancestor metric for parse accuracy

Natural Language Engineering
A statistical parser for Czech

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Feature-rich part-of-speech tagging with a cyclic dependency network

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Probabilistic parsing for German using sister-head dependencies

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Accurate unlexicalized parsing

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Is it harder to parse Chinese, or the Chinese Treebank?

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Intricacies of Collins' Parsing Model

Computational Linguistics
Head-Driven Statistical Models for Natural Language Parsing

Computational Linguistics
Lexicalization in crosslinguistic probabilistic parsing: the case of French

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Learning accurate, compact, and interpretable tree annotation

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Maximum entropy based restoration of Arabic diacritics

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Morphology and reranking for the statistical parsing of Spanish

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Construct state modification in the Arabic Treebank

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Integrated morphological and syntactic disambiguation for Modern Hebrew

COLING ACL '06 Proceedings of the 21st International Conference on computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop
Relational-realizational parsing

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Arabic preprocessing schemes for statistical machine translation

NAACL-Short '06 Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers
Arabic diacritization through full morphological tagging

NAACL-Short '07 Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers
Using a maximum entropy model to build segmentation lattices for MT

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Developing an Arabic treebank: methods, guidelines, procedures, and tools

Semitic '04 Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages
Evaluating and integrating treebank parsers on a biomedical corpus

Software '05 Proceedings of the Workshop on Software
CATiB: the Columbia Arabic Treebank

ACLShort '09 Proceedings of the ACL-IJCNLP 2009 Conference Short Papers
Self-training PCFG grammars with latent annotations across languages

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
Coarse-to-fine natural language processing

Coarse-to-fine natural language processing

Exploiting Separation of Closed-Class Categories for Arabic Tokenization and Part-of-Speech Tagging

ACM Transactions on Asian Language Information Processing (TALIP)
Improving Arabic dependency parsing with form-based and functional morphological features

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Language-independent parsing with empty elements

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Using derivation trees for treebank error detection

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Joint Hebrew segmentation and parsing using a PCFG-LA lattice parser

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Improved Arabic-to-English statistical machine translation by reordering post-verbal subjects for word alignment

Machine Translation
A class-based agreement model for generating accurately inflected translations

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Joint evaluation of morphological segmentation and syntactic parsing

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2
Joint Chinese word segmentation, POS tagging and parsing

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
A machine-learning framework for hybrid machine translation

KI'12 Proceedings of the 35th Annual German conference on Advances in Artificial Intelligence
Part of speech tagging for arabic

Natural Language Engineering
Word segmentation, unknown-word resolution, and morphological agreement in a hebrew parsing system

Computational Linguistics
Dependency parsing of modern standard arabic with lexical and inflectional features

Computational Linguistics
Parsing models for identifying multiword expressions

Computational Linguistics
A recommendation system for Twitter users in the same neighborhood

Proceedings of the 16th Communications & Networking Symposium

Quantified Score

Hi-index	0.01

Visualization

Abstract

In this paper, we offer broad insight into the underperformance of Arabic constituency parsing by analyzing the interplay of linguistic phenomena, annotation choices, and model design. First, we identify sources of syntactic ambiguity understudied in the existing parsing literature. Second, we show that although the Penn Arabic Treebank is similar to other tree-banks in gross statistical terms, annotation consistency remains problematic. Third, we develop a human interpretable grammar that is competitive with a latent variable PCFG. Fourth, we show how to build better models for three different parsers. Finally, we show that in application settings, the absence of gold segmentation lowers parsing performance by 2--5% F1.