Unsupervised Text Normalization Approach for Morphological Analysis of Blog Documents

Authors:
Kazushi Ikeda;Tadashi Yanagihara;Kazunori Matsumoto;Yasuhiro Takishima
Affiliations:
KDDI R&D Laboratories, Inc., Saitama, Japan 356-8502;KDDI R&D Laboratories, Inc., Saitama, Japan 356-8502;KDDI R&D Laboratories, Inc., Saitama, Japan 356-8502;KDDI R&D Laboratories, Inc., Saitama, Japan 356-8502
Venue:
AI '09 Proceedings of the 22nd Australasian Joint Conference on Advances in Artificial Intelligence
Year:
2009

Citing 5
Cited 0

A stochastic Japanese morphological analyzer using a forward-DP backward-A* N-best search algorithm

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 1
Word extraction from corpora and its part-of-speech estimation using distributional analysis

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2
Automatic construction of Japanese KATAKANA variant list from large corpus

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Exploring in the weblog space by detecting informative and affective articles

Proceedings of the 16th international conference on World Wide Web
Online acquisition of Japanese unknown morphemes using morphological constraints

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing

Quantified Score

Hi-index	0.01

Visualization

Abstract

In this paper, we propose an algorithm for reducing the number of unknown words on blog documents by replacing peculiar expressions with formal expressions. Japanese blog documents contain many peculiar expressions regarded as unknown sequences by morphological analyzers. Reducing these unknown sequences improves the accuracy of morphological analysis for blog documents. Manual registration of peculiar expressions to the morphological dictionaries is a conventional solution, which is costly and requires specialized knowledge. In our algorithm, substitution candidates of peculiar expressions are automatically retrieved from formally written documents such as newspapers and stored as substitution rules. For the correct replacement, a substitution rule is selected based on three criteria; its appearance frequency in retrieval process, the edit distance between substituted sequences and the original text, and the estimated accuracy improvements of word segmentation after the substitution. Experimental results show our algorithm reduces the number of unknown words by 30.3%, maintaining the same segmentation accuracy as the conventional methods, which is twice the reduction rate of the conventional methods.