Identifying, the coding system and language, of on-line documents on the Internet

Authors:
Gen-itiro Kikui
Affiliations:
NTT Information and Communication Systems Laboratories, Kanagawa, Japan
Venue:
COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2
Year:
1996

Citing 2
Cited 8

Understanding Japanese information processing

Understanding Japanese information processing
Language determination: natural language processing from scanned document images

ANLC '94 Proceedings of the fourth conference on Applied natural language processing

Multilingual information access

Lectures on information retrieval
Multilingual Information Access

ESSIR '00 Proceedings of the Third European Summer-School on Lectures on Information Retrieval-Revised Lectures
Towards an intelligent multilingual keyboard system

HLT '01 Proceedings of the first international conference on Human language technology research
Multi-language named-entity recognition system based on HMM

MultiNER '03 Proceedings of the ACL 2003 workshop on Multilingual and mixed-language named entity recognition - Volume 15
Conquering Language: Using NLP on a Massive Scale to Build High Dimensional Language Models from the Web

CICLing '07 Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing
Study of some distance measures for language and encoding identification

LD '06 Proceedings of the Workshop on Linguistic Distances
Language identification in multi-lingual web-documents

NLDB'06 Proceedings of the 11th international conference on Applications of Natural Language to Information Systems
Text segmentation by language using minimum description length

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes a new algorithm that simultaneously identifies the coding system and language of a code string fetched from the Internet, especially World-Wide Web. The algorithm uses statistic language models to select the correctly decoded string as well as to determine the language. The proposed algorithm covers 9 languages and 11 coding systems used in Eastern Asia and Western Europe. Experimental results show that the level of accuracy of our algorithm is over 95% for 640 on-line documents.