A probabilistic approach to compound noun indexing in Korean texts

Authors:
Hyouk R. Park;Young S. Han;Kang H. Lee;Key-Sun Choi
Affiliations:
Korea R&D Information Center/KIST, YuSong Taejon, Korea;Korea R&D Information Center/KIST, YuSong Taejon, Korea;Korea R&D Information Center/KIST, YuSong Taejon, Korea;Computer Science Department KAIST, YuSong Taejon, Korea
Venue:
COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 1
Year:
1996

Citing 7
Cited 0

Principles and practice of information theory

Principles and practice of information theory
Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
The effectiveness of a nonsyntatic approach to automatic phrase indexing for document retrieval

Journal of the American Society for Information Science
Ranking algorithms

Information retrieval
Simple word strings as compound keywords: an indexing and ranking method for Japanese texts

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
A comparison of indexing techniques for Japanese text retrieval

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we address the problem of compound noun indexing that is about segmenting or decomposing compound nouns into promising index terms. Compound nouns as index terms that usually subscribe to specific notions tend to increase the precision of retrieval performance. The use of the component nouns of a compound noun as index terms, on the other hand, may improve the recall performance, but can decrease the precision.Our proposed method to handle compound nouns with a goal to increase the recall while preserving the precision computes the relevance of the component nouns of a compound noun to the document content by comparing the document sets that are supported by the component nouns and the terms of the document. The operational content of a term is represented as the probabilistic distribution of the term over the document set.Experiments with a set of 1,000 documents show that our method gains 33% increase of retrieval performance compared to the indexing method without compound noun analysis, and is as good as manual decomposition by human experts.