Document classification using domain specific kanji characters extracted by X2 method

Authors:
Yasuhiko Watanabe;Masaki Murata;Masahito Takeuchi;Makoto Nagao
Affiliations:
Ryukoku University, Shiga, Japan;Kyoto University, Kyoto, Japan;Kyoto University, Kyoto, Japan;Kyoto University, Kyoto, Japan
Venue:
COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2
Year:
1996

Citing 3
Cited 1

Automatic document classification: natural language processing, statistical analysis, and expert system techniques used together

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Classifying news stories using memory based reasoning

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Document classification by machine: theory and practice

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 2

An automatic extraction of key paragraphs based on context dependency

ANLC '97 Proceedings of the fifth conference on Applied natural language processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we describe a method of classifying Japanese text documents using domain specific kanji characters. Text documents are generally classified by significant words (keywords) of the documents. However, it is difficult to extract these significant words from Japanese text, because Japanese texts are written without using blank spaces, such as delimiters, and must be segmented into words. Therefore, instead of words, we used domain specific kanji characters which appear more frequently in one domain than the other. We extracted these domain specific kanji characters by X2 method. Then, using these domain specific kanji characters, we classified editorial columns "TENSEI JINGO", editorial articles, and articles in "Scientific American (in Japanese)". The correct recognition scores for them were 47%, 74%, and 85%, respectively.