Compound noun segmentation based on lexical data extracted from corpus

Authors:
Juntae Yoon
Affiliations:
IRCS, University of Pennsylvania, 3401 Walnut St., Suite 400A, Philadelphia, PA 19104-6228, USA/ e-mail: jtyoon@linc.cis.upenn.edu
Venue:
Natural Language Engineering
Year:
2001

Citing 6
Cited 1

Using n-grams for Korean text retrieval

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
A Syntax-Analysis Procedure for Unambiguous Context-Free Grammars

Journal of the ACM (JACM)
Introduction To Automata Theory, Languages, And Computation

Introduction To Automata Theory, Languages, And Computation
Spelling correction using context

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
A stochastic finite-state word-segmentation algorithm for Chinese

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
A new method of N-gram statistics for large number of n and automatic extraction of words and phrases from large text data of Japanese

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 1

Korean POI word segmentation and tagging for speech interfac in-vehicle navigation system

ICACT'09 Proceedings of the 11th international conference on Advanced Communication Technology - Volume 3

Quantified Score

Hi-index	0.00

Visualization

Abstract

Compound noun segmentation is one of the crucial problems in Korean language processing because a series of nouns in Korean may appear without space in real text, which makes it difficult to identify its morphological constituents. This paper presents an effective method of Korean compound noun segmentation based on lexical data extracted from a corpus. The segmentation consists of two tasks: First, it uses a Hand-Build Segmentation Dictionary (HBSD) to segment compound nouns which frequently occur or need an exceptional process. Second, a segmentation algorithm using data from a corpus is proposed, where simple nouns and their frequencies are stored in a Simple Noun Dictionary (SND) for segmentation. The analysis is executed based on modified tabular parsing using min-max operation. Our experiments have shown a very effective accuracy rate of about 97.29%, which turns out to be very effective.