Automatic tagging of Arabic text: from raw text to base phrase chunks

Authors:
Mona Diab;Kadri Hacioglu;Daniel Jurafsky
Affiliations:
Stanford University;University of Colorado, Boulder;Stanford University
Venue:
HLT-NAACL-Short '04 Proceedings of HLT-NAACL 2004: Short Papers
Year:
2004

Citing 6
Cited 50

The nature of statistical learning theory

The nature of statistical learning theory
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Reducing multiclass to binary: a unifying approach for margin classifiers

The Journal of Machine Learning Research
Target word detection and semantic role chunking using support vector machines

NAACL-Short '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: companion volume of the Proceedings of HLT-NAACL 2003--short papers - Volume 2
Introduction to the CoNLL-2000 shared task: chunking

ConLL '00 Proceedings of the 2nd workshop on Learning language in logic and the 4th conference on Computational natural language learning - Volume 7
Use of support vector learning for chunk identification

ConLL '00 Proceedings of the 2nd workshop on Learning language in logic and the 4th conference on Computational natural language learning - Volume 7

Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Combination of Arabic preprocessing schemes for statistical machine translation

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
An unsupervised morpheme-based HMM for hebrew morphological disambiguation

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Noun phrase chunking in Hebrew: influence of lexical and morphological features

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Context-based morphological disambiguation with random fields

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Part-of-speech tagging of modern hebrew text

Natural Language Engineering
Impact of Term-Indexing for Arabic Document Retrieval

NLDB '08 Proceedings of the 13th international conference on Natural Language and Information Systems: Applications of Natural Language to Information Systems
Automatic Annotation of Direct Reported Speech in Arabic and French, According to a Semantic Map of Enunciative Modalities

GoTAL '08 Proceedings of the 6th international conference on Advances in Natural Language Processing
Icelandic data driven part of speech tagging

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Methods for Amharic part-of-speech tagging

AfLaT '09 Proceedings of the First Workshop on Language Technologies for African Languages
Lexicon acquisition for dialectal Arabic using transductive learning

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Online large-margin training of syntactic and structural translation features

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
A hybrid approach for building Arabic diacritizer

Semitic '09 Proceedings of the EACL 2009 Workshop on Computational Approaches to Semitic Languages
Memory-based morphological analysis generation and part-of-speech tagging of Arabic

Semitic '05 Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages
Choosing an optimal architecture for segmentation and POS-tagging of modern Hebrew

Semitic '05 Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages
Part of speech tagging for Amharic using conditional random fields

Semitic '05 Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages
POS tagging of dialectal Arabic: a minimally supervised approach

Semitic '05 Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages
Arabic diacritization using weighted finite-state transducers

Semitic '05 Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages
Localization of difficult-to-translate phrases

StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation
Using shallow syntax information to improve word alignment and reordering for SMT

StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
Discriminative Phrase-Based Models for Arabic Machine Translation

ACM Transactions on Asian Language Information Processing (TALIP)
Morphology-Based Segmentation Combination for Arabic Mention Detection

ACM Transactions on Asian Language Information Processing (TALIP)
Cross-Language Information Propagation for Arabic Mention Detection

ACM Transactions on Asian Language Information Processing (TALIP)
Arabic tokenization system

Semitic '07 Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources
Improved Arabic base phrase chunking with a new enriched POS tag set

Semitic '07 Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources
Smoothing a lexicon-based POS tagger for Arabic and Hebrew

Semitic '07 Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources
Morpho-syntactic Arabic preprocessing for Arabic-to-English statistical machine translation

StatMT '06 Proceedings of the Workshop on Statistical Machine Translation
Symbolic-to-statistical hybridization: extending generation-heavy machine translation

Machine Translation
Is Arabic part of speech tagging feasible without word segmentation?

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Chunk-based verb reordering in VSO sentences for Arabic-English statistical machine translation

WMT '10 Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR
A new approach to lexical disambiguation of Arabic text

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
A probabilistic morphological analyzer for Syriac

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Enhancing mention detection using projection via aligned corpora

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
An accuracy-enhanced light stemmer for arabic text

ACM Transactions on Speech and Language Processing (TSLP)
An efficient part-of-speech tagger for arabic

CICLing'11 Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part I
Developing a competitive HMM arabic POS tagger using small training corpora

ACIIDS'11 Proceedings of the Third international conference on Intelligent information and database systems - Volume Part I
An application of pattern matching stemmer in arabic dialogue system

KES-AMSTA'11 Proceedings of the 5th KES international conference on Agent and multi-agent systems: technologies and applications
Toward enhanced Arabic speech recognition using part of speech tagging

International Journal of Speech Technology
Applying authorship analysis to arabic web content

ISI'05 Proceedings of the 2005 IEEE international conference on Intelligence and Security Informatics
Soft syntactic constraints for Arabic---English hierarchical phrase-based translation

Machine Translation
A comparison of segmentation methods and extended lexicon models for Arabic statistical machine translation

Machine Translation
Chunk-lattices for verb reordering in Arabic---English statistical machine translation

Machine Translation
Arabic morphological analysis and disambiguation using a possibilistic classifier

ICIC'12 Proceedings of the 8th international conference on Intelligent Computing Theories and Applications
Cutting the long tail: hybrid language models for translation style adaptation

EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
Identifying broken plurals, irregular gender, and rationality in Arabic text

EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
Modified distortion matrices for phrase-based statistical machine translation

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Part of speech tagging for arabic

Natural Language Engineering
Dependency parsing of modern standard arabic with lexical and inflectional features

Computational Linguistics
Aligned-Parallel-Corpora Based Semi-Supervised Learning for Arabic Mention Detection

IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP)
Maximum-entropy word alignment and posterior-based phrase extraction for machine translation

Machine Translation

Quantified Score

Hi-index	0.00

Visualization

Abstract

To date, there are no fully automated systems addressing the community's need for fundamental language processing tools for Arabic text. In this paper, we present a Support Vector Machine (SVM) based approach to automatically tokenize (segmenting off clitics), part-of-speech (POS) tag and annotate base phrases (BPs) in Arabic text. We adapt highly accurate tools that have been developed for English text and apply them to Arabic text. Using standard evaluation metrics, we report that the SVM-TOK tokenizer achieves an Fβ=1 score of 99.12, the SVM-POS tagger achieves an accuracy of 95.49%, and the SVM-BP chunker yields an Fβ=1 score of 92.08.