A part of speech estimation method for Japanese unknown words using a statistical model of morphology and context

Authors:
Masaaki Nagata
Affiliations:
NTT Cyber Space Laboratories, Kanagawa, Japan
Venue:
ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Year:
1999

Citing 10
Cited 11

An estimate of an upper bound for the entropy of English

Computational Linguistics
Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging

Computational Linguistics
Estimating lexical priors for low-frequency morphologically ambiguous forms

Computational Linguistics
A stochastic finite-state word-segmentation algorithm for Chinese

Computational Linguistics
Coping with ambiguity and unknown words through probabilistic models

Computational Linguistics - Special issue on using large corpora: II
Automatic rule induction for unknown-word guessing

Computational Linguistics
Mistake-driven mixture of hierarchical tag context trees

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
A stochastic Japanese morphological analyzer using a forward-DP backward-A* N-best search algorithm

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 1
Context-based spelling correction for Japanese OCR

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2
Word extraction from corpora and its part-of-speech estimation using distributional analysis

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2

Language independent morphological analysis

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
Morphological analysis of the spontaneous speech corpus

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 2
Morphological analysis of a large spontaneous speech corpus in Japanese

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Guessing parts-of-speech of unknown words using global information

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Japanese unknown word identification by character-based chunking

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Chinese and Japanese word segmentation using word-level and character-level information

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Online acquisition of Japanese unknown morphemes using morphological constraints

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Morphological annotation of a large spontaneous speech corpus in Japanese

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
An error-driven word-character hybrid model for joint Chinese word segmentation and POS tagging

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Semantic classification of automatically acquired nouns using lexico-syntactic clues

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Construction of wakamono kotoba emotion dictionary and its application

CICLing'11 Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a statistical model of Japanese unknown words consisting of a set of length and spelling models classified by the character types that constitute a word. The point is quite simple: different character sets should be treated differently and the changes between character types are very important because Japanese script has both ideograms like Chinese (kanji) and phonograms like English (katakana). Both word segmentation accuracy and part of speech tagging accuracy are improved by the proposed model. The model can achieve 96.6% tagging accuracy if unknown words are correctly segmented.