Japanese unknown word identification by character-based chunking

Authors:
Masayuki Asahara;Yuji Matsumoto
Affiliations:
Nara Institute of Science and Technology, Japan;Nara Institute of Science and Technology, Japan
Venue:
COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Year:
2004

Citing 10
Cited 7

TnT: a statistical part-of-speech tagger

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
Word extraction from corpora and its part-of-speech estimation using distributional analysis

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2
A part of speech estimation method for Japanese unknown words using a statistical model of morphology and context

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Morphological analysis of the spontaneous speech corpus

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 2
Unknown word extraction for Chinese documents

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Chunking with support vector machines

NAACL '01 Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies
Japanese Named Entity extraction with redundant morphological analysis

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Morphological analysis of a large spontaneous speech corpus in Japanese

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Chinese unknown word identification using character-based tagging and chunking

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 2
Automatic recognition of Chinese unknown words based on roles tagging

SIGHAN '02 Proceedings of the first SIGHAN workshop on Chinese language processing - Volume 18

A collaborative framework for collecting Thai unknown words from the web

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Online acquisition of Japanese unknown morphemes using morphological constraints

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Semantic classification of automatically acquired nouns using lexico-syntactic clues

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Construction of wakamono kotoba emotion dictionary and its application

CICLing'11 Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part I
Chinese new word identification: a latent discriminative model with global features

Journal of Computer Science and Technology - Special issue on natural language processing
Non-parametric bayesian segmentation of Japanese noun phrases

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Boosting-based ensemble learning with penalty profiles for automatic Thai unknown word recognition

Computers & Mathematics with Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

We introduce a character-based chunking for unknown word identification in Japanese text. A major advantage of our method is an ability to detect low frequency unknown words of unrestricted character type patterns. The method is built upon SVM-based chunking, by use of character n-gram and surrounding context of n-best word segmentation candidates from statistical morphological analysis as features. It is applied to newspapers and patent texts, achieving 95% precision and 55-70% recall for newspapers and more than 85% precision for patent texts.