A language and character set determination method based on N-gram statistics

  • Authors:
  • Izumi Suzuki;Yoshiki Mikami;Ario Ohsato;Yoshihide Chubachi

  • Affiliations:
  • Nagaoka University of Technology;Nagaoka University of Technology;Nagaoka University of Technology;Numeric & Co. Ltd.

  • Venue:
  • ACM Transactions on Asian Language Information Processing (TALIP)
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

An N-gram-based language, script, and encoding scheme-detection method is introduced in this article. The method detects language, script, and encoding schemes using a target text document encoded by computer by checking how many byte sequences of the target match the byte sequences that can appear in the texts belonging to a language, script, and encoding scheme. This detection mechanism is different from conventional N-gram-based methods in that its threshold for any category is uniquely predetermined. The method was originally created for a survey of web pages conducted to find how many web pages are written in a particular language, script, and encoding scheme. The requirement is that the method must be able to respond to either "correct answer" or "unable to detect" where "unable to detect" includes "other than registered." There are some minor problems with this method, but its effectiveness as a language, script, and encoding scheme-detection method has been confirmed by experiments.