Techniques for automatically correcting words in text
ACM Computing Surveys (CSUR)
TextTiling: A Quantitative Approach to Discourse
TextTiling: A Quantitative Approach to Discourse
Advances in domain independent linear text segmentation
NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
Predicting Structured Data (Neural Information Processing)
Predicting Structured Data (Neural Information Processing)
On Evaluation Methodologies for Text Segmentation Algorithms
ICTAI '07 Proceedings of the 19th IEEE International Conference on Tools with Artificial Intelligence - Volume 02
Hi-index | 0.00 |
Language classification is a preliminary step for most natural-language related processes. The significant quantity of multilingual documents poses a problem for traditional language-classification schemes and requires segmentation of the document to monolingual sections. This phenomenon is characteristic of classical and medieval Jewish literature, which frequently mixes Hebrew, Aramaic, Judeo-Arabic and other Hebrew-script languages. We propose a method for classification and segmentation of multi-lingual texts in the Hebrew character set, using bigram statistics. For texts, such as the manuscripts found in the Cairo Genizah, we are also forced to deal with a significant level of noise in OCR-processed text.