Language classification and segmentation of noisy documents in Hebrew scripts

  • Authors:
  • Nachum Dershowitz;Alex Zhicharevich

  • Affiliations:
  • Tel Aviv University, Ramat Aviv, Israel;Tel Aviv University, Ramat Aviv, Israel

  • Venue:
  • LaTeCH '12 Proceedings of the 6th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Language classification is a preliminary step for most natural-language related processes. The significant quantity of multilingual documents poses a problem for traditional language-classification schemes and requires segmentation of the document to monolingual sections. This phenomenon is characteristic of classical and medieval Jewish literature, which frequently mixes Hebrew, Aramaic, Judeo-Arabic and other Hebrew-script languages. We propose a method for classification and segmentation of multi-lingual texts in the Hebrew character set, using bigram statistics. For texts, such as the manuscripts found in the Cairo Genizah, we are also forced to deal with a significant level of noise in OCR-processed text.