Compound noun segmentation based on lexical data extracted from corpus

Authors:
Juntae Yoon
Affiliations:
IRCS, University of Pennsylvania, Philadelphia, PA
Venue:
ANLC '00 Proceedings of the sixth conference on Applied natural language processing
Year:
2000

Citing 5
Cited 0

Using n-grams for Korean text retrieval

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Introduction To Automata Theory, Languages, And Computation

Introduction To Automata Theory, Languages, And Computation
Spelling correction using context

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
A stochastic finite-state word-segmentation algorithm for Chinese

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
A new method of N-gram statistics for large number of n and automatic extraction of words and phrases from large text data of Japanese

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 1

Quantified Score

Hi-index	0.00

Visualization

Abstract

Compound noun analysis is one of the crucial problems in Korean language processing because a series of nouns in Korean may appear without white space in real texts, which makes it difficult to identify the morphological constituents. This paper presents an effective method of Korean compound noun segmentation based on lexical data extracted from corpus. The segmentation is done by two steps: First, it is based on manually constructed built-in dictionary for segmentation whose data were extracted from 30 million word corpus. Second, a segmentation algorithm using statistical data is proposed, where simple nouns and their frequencies are also extracted from corpus. The analysis is executed based on CYK tabular parsing and min-max operation. By experiments, its accuracy is about 97.29%, which turns out to be very effective.