A language and character set determination method based on N-gram statistics

Authors:
Izumi Suzuki;Yoshiki Mikami;Ario Ohsato;Yoshihide Chubachi
Affiliations:
Nagaoka University of Technology;Nagaoka University of Technology;Nagaoka University of Technology;Numeric & Co. Ltd.
Venue:
ACM Transactions on Asian Language Information Processing (TALIP)
Year:
2002

Citing 2
Cited 6

Foundations of statistical natural language processing

Foundations of statistical natural language processing
Text Categorization Using Compression Models

DCC '00 Proceedings of the Conference on Data Compression

The language observatory project (LOP)

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Multilingual ICT education: language observatory as a monitoring instrument

SEARCC '05 Proceedings of the 2005 South East Asia Regional Computer Science Confederation (SEARCC) Conference - Volume 46
Monotone Increasing Binary Similarity and Its Application to Automatic Document-Acquisition of a Category

IEICE - Transactions on Information and Systems
Design and implementation-algorithms of Amharic search engine system for Amharic web contents

NTMS'09 Proceedings of the 3rd international conference on New technologies, mobility and security
Arabic script web page language identifications using decision tree neural networks

Pattern Recognition
A high performance centroid-based classification approach for language identification

Pattern Recognition Letters

Quantified Score

Hi-index	0.00

Visualization

Abstract

An N-gram-based language, script, and encoding scheme-detection method is introduced in this article. The method detects language, script, and encoding schemes using a target text document encoded by computer by checking how many byte sequences of the target match the byte sequences that can appear in the texts belonging to a language, script, and encoding scheme. This detection mechanism is different from conventional N-gram-based methods in that its threshold for any category is uniquely predetermined. The method was originally created for a survey of web pages conducted to find how many web pages are written in a particular language, script, and encoding scheme. The requirement is that the method must be able to respond to either "correct answer" or "unable to detect" where "unable to detect" includes "other than registered." There are some minor problems with this method, but its effectiveness as a language, script, and encoding scheme-detection method has been confirmed by experiments.