Statistical methods for speech recognition
Statistical methods for speech recognition
The mathematics of statistical machine translation: parameter estimation
Computational Linguistics - Special issue on using large corpora: II
Unsupervised learning of the morphology of a natural language
Computational Linguistics
An iterative algorithm to build Chinese language models
ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
Arabic finite-state morphological analysis and generation
COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 1
Inducing multilingual text analysis tools via robust projection across aligned corpora
HLT '01 Proceedings of the first international conference on Human language technology research
Knowledge-free induction of inflectional morphologies
NAACL '01 Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies
Minimally supervised morphological analysis by multimodal alignment
ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Building a shallow Arabic Morphological Analyzer in one day
SEMITIC '02 Proceedings of the ACL-02 workshop on Computational approaches to semitic languages
TIPS: a translingual information processing system
NAACL-Demonstrations '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: Demonstrations - Volume 4
Unsupervised learning of Arabic stemming using a parallel corpus
ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Stemming to improve translation lexicon creation form bitexts
Information Processing and Management: an International Journal
Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop
ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Maximum entropy based restoration of Arabic diacritics
ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
A maximum entropy word aligner for Arabic-English machine translation
HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Story segmentation of broadcast news in Arabic, Chinese and English using multi-window features
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Part-of-speech tagging of modern hebrew text
Natural Language Engineering
Arabic diacritic restoration approach based on maximum entropy models
Computer Speech and Language
ANERsys: An Arabic Named Entity Recognition System Based on Maximum Entropy
CICLing '07 Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing
Adapting the JIRS Passage Retrieval System to the Arabic Language
CICLing '07 Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing
Automatic speech segmentation using the Arabic phonetic database
ICAI'09 Proceedings of the 10th WSEAS international conference on Automation & information
EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Mention detection crossing the language barrier
EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Morphological analysis for statistical machine translation
HLT-NAACL-Short '04 Proceedings of HLT-NAACL 2004: Short Papers
Bridging the inflection morphology gap for Arabic statistical machine translation
NAACL-Short '06 Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers
Semitic '09 Proceedings of the EACL 2009 Workshop on Computational Approaches to Semitic Languages
Examining the effect of improved context sensitive morphology on Arabic information retrieval
Semitic '05 Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages
Choosing an optimal architecture for segmentation and POS-tagging of modern Hebrew
Semitic '05 Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages
The impact of morphological stemming on Arabic mention detection and coreference resolution
Semitic '05 Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages
Morphology-Based Segmentation Combination for Arabic Mention Detection
ACM Transactions on Asian Language Information Processing (TALIP)
Cross-Language Information Propagation for Arabic Mention Detection
ACM Transactions on Asian Language Information Processing (TALIP)
Arabic cross-document person name normalization
Semitic '07 Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources
Smoothing a lexicon-based POS tagger for Arabic and Hebrew
Semitic '07 Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources
Morpho-syntactic Arabic preprocessing for Arabic-to-English statistical machine translation
StatMT '06 Proceedings of the Workshop on Statistical Machine Translation
Context-free reordering, finite-state translation
HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Simplified feature set for Arabic named entity recognition
NEWS '10 Proceedings of the 2010 Named Entities Workshop
A probabilistic morphological analyzer for Syriac
EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Enhancing mention detection using projection via aligned corpora
EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
An accuracy-enhanced light stemmer for arabic text
ACM Transactions on Speech and Language Processing (TSLP)
Is a query worth translating: ask the users!
ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
Script-agnostic reflow of text in document images
Proceedings of the 13th International Conference on Human Computer Interaction with Mobile Devices and Services
A framework for retrieving Arabic documents based on queries written in Arabic slang language
Journal of Information Science
Structured ramp loss minimization for machine translation
NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Arabic retrieval revisited: morphological hole filling
ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2
Twitter translation using translation-based cross-lingual retrieval
WMT '12 Proceedings of the Seventh Workshop on Statistical Machine Translation
Part of speech tagging for arabic
Natural Language Engineering
Aligned-Parallel-Corpora Based Semi-Supervised Learning for Arabic Mention Detection
IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP)
Hi-index | 0.00 |
We approximate Arabic's rich morphology by a model that a word consists of a sequence of morphemes in the pattern prefix*-stem-suffix* (* denotes zero or more occurrences of a morpheme). Our method is seeded by a small manually segmented Arabic corpus and uses it to bootstrap an unsupervised algorithm to build the Arabic word segmenter from a large unsegmented Arabic corpus. The algorithm uses a trigram language model to determine the most probable morpheme sequence for a given input. The language model is initially estimated from a small manually segmented corpus of about 110,000 words. To improve the segmentation accuracy, we use an unsupervised algorithm for automatically acquiring new stems from a 155 million word unsegmented corpus, and re-estimate the model parameters with the expanded vocabulary and training corpus. The resulting Arabic word segmentation system achieves around 97% exact match accuracy on a test corpus containing 28,449 word tokens. We believe this is a state-of-the-art performance and the algorithm can be used for many highly inflected languages provided that one can create a small manually segmented corpus of the language of interest.