Language model based arabic word segmentation

Authors:
Young-Suk Lee;Kishore Papineni;Salim Roukos;Ossama Emam;Hany Hassan
Affiliations:
IBM T. J. Watson Research Center, Yorktown Heights, NY;IBM T. J. Watson Research Center, Yorktown Heights, NY;IBM T. J. Watson Research Center, Yorktown Heights, NY;IBM Cairo Technology Development Center, El-Ahram, Giza, Egypt;IBM Cairo Technology Development Center, El-Ahram, Giza, Egypt
Venue:
ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Year:
2003

Citing 9
Cited 39

Statistical methods for speech recognition

Statistical methods for speech recognition
The mathematics of statistical machine translation: parameter estimation

Computational Linguistics - Special issue on using large corpora: II
Unsupervised learning of the morphology of a natural language

Computational Linguistics
An iterative algorithm to build Chinese language models

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
Arabic finite-state morphological analysis and generation

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 1
Inducing multilingual text analysis tools via robust projection across aligned corpora

HLT '01 Proceedings of the first international conference on Human language technology research
Knowledge-free induction of inflectional morphologies

NAACL '01 Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies
Minimally supervised morphological analysis by multimodal alignment

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Building a shallow Arabic Morphological Analyzer in one day

SEMITIC '02 Proceedings of the ACL-02 workshop on Computational approaches to semitic languages

TIPS: a translingual information processing system

NAACL-Demonstrations '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: Demonstrations - Volume 4
Unsupervised learning of Arabic stemming using a parallel corpus

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Stemming to improve translation lexicon creation form bitexts

Information Processing and Management: an International Journal
Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Maximum entropy based restoration of Arabic diacritics

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
A maximum entropy word aligner for Arabic-English machine translation

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Story segmentation of broadcast news in Arabic, Chinese and English using multi-window features

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Part-of-speech tagging of modern hebrew text

Natural Language Engineering
Arabic diacritic restoration approach based on maximum entropy models

Computer Speech and Language
ANERsys: An Arabic Named Entity Recognition System Based on Maximum Entropy

CICLing '07 Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing
Adapting the JIRS Passage Retrieval System to the Arabic Language

CICLing '07 Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing
Automatic speech segmentation using the Arabic phonetic database

ICAI'09 Proceedings of the 10th WSEAS international conference on Automation & information
Arabic OCR error correction using character segment correction, language modeling, and shallow morphology

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Mention detection crossing the language barrier

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Morphological analysis for statistical machine translation

HLT-NAACL-Short '04 Proceedings of HLT-NAACL 2004: Short Papers
Bridging the inflection morphology gap for Arabic statistical machine translation

NAACL-Short '06 Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers
Unsupervised concept discovery in Hebrew using simple unsupervised word prefix segmentation for Hebrew and Arabic

Semitic '09 Proceedings of the EACL 2009 Workshop on Computational Approaches to Semitic Languages
Examining the effect of improved context sensitive morphology on Arabic information retrieval

Semitic '05 Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages
Choosing an optimal architecture for segmentation and POS-tagging of modern Hebrew

Semitic '05 Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages
The impact of morphological stemming on Arabic mention detection and coreference resolution

Semitic '05 Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages
Morphology-Based Segmentation Combination for Arabic Mention Detection

ACM Transactions on Asian Language Information Processing (TALIP)
Cross-Language Information Propagation for Arabic Mention Detection

ACM Transactions on Asian Language Information Processing (TALIP)
Arabic cross-document person name normalization

Semitic '07 Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources
Smoothing a lexicon-based POS tagger for Arabic and Hebrew

Semitic '07 Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources
Morpho-syntactic Arabic preprocessing for Arabic-to-English statistical machine translation

StatMT '06 Proceedings of the Workshop on Statistical Machine Translation
Context-free reordering, finite-state translation

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Simplified feature set for Arabic named entity recognition

NEWS '10 Proceedings of the 2010 Named Entities Workshop
A probabilistic morphological analyzer for Syriac

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Enhancing mention detection using projection via aligned corpora

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
An accuracy-enhanced light stemmer for arabic text

ACM Transactions on Speech and Language Processing (TSLP)
Is a query worth translating: ask the users!

ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
Script-agnostic reflow of text in document images

Proceedings of the 13th International Conference on Human Computer Interaction with Mobile Devices and Services
Methods for integrating rule-based and statistical systems for Arabic to English machine translation

Machine Translation
A framework for retrieving Arabic documents based on queries written in Arabic slang language

Journal of Information Science
Structured ramp loss minimization for machine translation

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Arabic retrieval revisited: morphological hole filling

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2
Twitter translation using translation-based cross-lingual retrieval

WMT '12 Proceedings of the Seventh Workshop on Statistical Machine Translation
Part of speech tagging for arabic

Natural Language Engineering
Aligned-Parallel-Corpora Based Semi-Supervised Learning for Arabic Mention Detection

IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We approximate Arabic's rich morphology by a model that a word consists of a sequence of morphemes in the pattern prefix*-stem-suffix* (* denotes zero or more occurrences of a morpheme). Our method is seeded by a small manually segmented Arabic corpus and uses it to bootstrap an unsupervised algorithm to build the Arabic word segmenter from a large unsegmented Arabic corpus. The algorithm uses a trigram language model to determine the most probable morpheme sequence for a given input. The language model is initially estimated from a small manually segmented corpus of about 110,000 words. To improve the segmentation accuracy, we use an unsupervised algorithm for automatically acquiring new stems from a 155 million word unsegmented corpus, and re-estimate the model parameters with the expanded vocabulary and training corpus. The resulting Arabic word segmentation system achieves around 97% exact match accuracy on a test corpus containing 28,449 word tokens. We believe this is a state-of-the-art performance and the algorithm can be used for many highly inflected languages provided that one can create a small manually segmented corpus of the language of interest.