Japanese unknown word identification by character-based chunking

  • Authors:
  • Masayuki Asahara;Yuji Matsumoto

  • Affiliations:
  • Nara Institute of Science and Technology, Japan;Nara Institute of Science and Technology, Japan

  • Venue:
  • COLING '04 Proceedings of the 20th international conference on Computational Linguistics
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

We introduce a character-based chunking for unknown word identification in Japanese text. A major advantage of our method is an ability to detect low frequency unknown words of unrestricted character type patterns. The method is built upon SVM-based chunking, by use of character n-gram and surrounding context of n-best word segmentation candidates from statistical morphological analysis as features. It is applied to newspapers and patent texts, achieving 95% precision and 55-70% recall for newspapers and more than 85% precision for patent texts.