Language classification and segmentation of noisy documents in Hebrew scripts

Authors:
Nachum Dershowitz;Alex Zhicharevich
Affiliations:
Tel Aviv University, Ramat Aviv, Israel;Tel Aviv University, Ramat Aviv, Israel
Venue:
LaTeCH '12 Proceedings of the 6th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities
Year:
2012

Citing 5
Cited 0

Techniques for automatically correcting words in text

ACM Computing Surveys (CSUR)
TextTiling: A Quantitative Approach to Discourse

TextTiling: A Quantitative Approach to Discourse
Advances in domain independent linear text segmentation

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
Predicting Structured Data (Neural Information Processing)

Predicting Structured Data (Neural Information Processing)
On Evaluation Methodologies for Text Segmentation Algorithms

ICTAI '07 Proceedings of the 19th IEEE International Conference on Tools with Artificial Intelligence - Volume 02

Quantified Score

Hi-index	0.00

Visualization

Abstract

Language classification is a preliminary step for most natural-language related processes. The significant quantity of multilingual documents poses a problem for traditional language-classification schemes and requires segmentation of the document to monolingual sections. This phenomenon is characteristic of classical and medieval Jewish literature, which frequently mixes Hebrew, Aramaic, Judeo-Arabic and other Hebrew-script languages. We propose a method for classification and segmentation of multi-lingual texts in the Hebrew character set, using bigram statistics. For texts, such as the manuscripts found in the Cairo Genizah, we are also forced to deal with a significant level of noise in OCR-processed text.