Document classification using domain specific kanji characters extracted by X2 method

  • Authors:
  • Yasuhiko Watanabe;Masaki Murata;Masahito Takeuchi;Makoto Nagao

  • Affiliations:
  • Ryukoku University, Shiga, Japan;Kyoto University, Kyoto, Japan;Kyoto University, Kyoto, Japan;Kyoto University, Kyoto, Japan

  • Venue:
  • COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2
  • Year:
  • 1996

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper we describe a method of classifying Japanese text documents using domain specific kanji characters. Text documents are generally classified by significant words (keywords) of the documents. However, it is difficult to extract these significant words from Japanese text, because Japanese texts are written without using blank spaces, such as delimiters, and must be segmented into words. Therefore, instead of words, we used domain specific kanji characters which appear more frequently in one domain than the other. We extracted these domain specific kanji characters by X2 method. Then, using these domain specific kanji characters, we classified editorial columns "TENSEI JINGO", editorial articles, and articles in "Scientific American (in Japanese)". The correct recognition scores for them were 47%, 74%, and 85%, respectively.