The automatic extraction of open compounds from text corpora

Authors:
Virach Sornlertlamvanich;Hozumi Tanaka
Affiliations:
Tokyo Institute of Technology, Tokyo, Japan;Tokyo Institute of Technology, Tokyo, Japan
Venue:
COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2
Year:
1996

Citing 3
Cited 5

Word association norms, mutual information, and lexicography

Computational Linguistics
Automatically extracting and representing collocations for language generation

ACL '90 Proceedings of the 28th annual meeting on Association for Computational Linguistics
A new method of N-gram statistics for large number of n and automatic extraction of words and phrases from large text data of Japanese

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 1

Automatic corpus-based Thai word extraction with the c4.5 learning algorithm

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 2
Orthographic Errors in Web Pages: Toward Cleaner Web Corpora

Computational Linguistics
Statistical-Based Approach to Non-segmented Language Processing

IEICE - Transactions on Information and Systems
Boosting-based ensemble learning with penalty setting profiles for automatic Thai unknown word recognition

ICCCI'10 Proceedings of the Second international conference on Computational collective intelligence: technologies and applications - Volume Part II
Boosting-based ensemble learning with penalty profiles for automatic Thai unknown word recognition

Computers & Mathematics with Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes a new method for extracting open compounds (uninterrupted sequences of words) from text corpora of languages, such as Thai, Japanese and Korea that exhibit unexplicit word segmentation. Without applying word segmentation techniques to the inputted plain text, we generate n-gram data from it. We then count the occurence of each string and sort them in alphabetical order. It is significant that the frequency of occurrence of strings decreases when the window size of observation is extended. From the statistical point of view, a word is a string with a fixed pattern that is used repeatedly, meaning that it should occur with a higher frequency than a string that is not a word. We observe the variation of frequency of the sorted n-gram data and extract the strings that experience a significant change in frequency of occurrence when their length is extended. We apply this occurrence test to both the right and left hand sides of all strings to ensure the accurate detection of both boundaries of the string. The method returned satisfying results regardless of the size of the input file.