Arabic Natural Language Processing
Arabic Natural Language Processing
Spoken Arabic dialect identification using phonotactic modeling
Semitic '09 Proceedings of the EACL 2009 Workshop on Computational Approaches to Semitic Languages
Using Mechanical Turk to annotate lexicons for less commonly used languages
CSLDAMT '10 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
Crowdsourcing translation: professional quality from non-professionals
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Dialect Classification via Text-Independent Training and Testing for Arabic, Spanish, and Chinese
IEEE Transactions on Audio, Speech, and Language Processing
Machine translation of Arabic dialects
NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Language identification for creating language-specific Twitter collections
LSM '12 Proceedings of the Second Workshop on Language in Social Media
Twitter translation using translation-based cross-lingual retrieval
WMT '12 Proceedings of the Seventh Workshop on Statistical Machine Translation
Hi-index | 0.00 |
The written form of Arabic, Modern Standard Arabic (MSA), differs quite a bit from the spoken dialects of Arabic, which are the true "native" languages of Arabic speakers used in daily life. However, due to MSA's prevalence in written form, almost all Arabic datasets have predominantly MSA content. We present the Arabic Online Commentary Dataset, a 52M-word monolingual dataset rich in dialectal content, and we describe our long-term annotation effort to identify the dialect level (and dialect itself) in each sentence of the dataset. So far, we have labeled 108K sentences, 41% of which as having dialectal content. We also present experimental results on the task of automatic dialect identification, using the collected labels for training and evaluation.