Unknown Chinese word extraction based on variety of overlapping strings

Authors:
Yunming Ye;Qingyao Wu;Yan Li;K. P. Chow;Lucas C. K. Hui;S. M. Yiu
Affiliations:
Department of Computer Science, Shenzhen Graduate School, Harbin Institute of Technology, Shenzhen, China and Shenzhen Key Laboratory of Internet Information Collaboration, Shenzhen, China;Department of Computer Science, Shenzhen Graduate School, Harbin Institute of Technology, Shenzhen, China and Shenzhen Key Laboratory of Internet Information Collaboration, Shenzhen, China;Shenzhen Polytechnic, Shenzhen, China;Department of Computer Science, The University of Hong Kong, Hong Kong, China;Department of Computer Science, The University of Hong Kong, Hong Kong, China;Department of Computer Science, The University of Hong Kong, Hong Kong, China
Venue:
Information Processing and Management: an International Journal
Year:
2013

Citing 19
Cited 0

Word association norms, mutual information, and lexicography

Computational Linguistics
Mostly-unsupervised statistical segmentation of Japanese Kanji sequences

Natural Language Engineering
Accessor variety criteria for Chinese word extraction

Computational Linguistics
Unknown word extraction for Chinese documents

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Statistically-enhanced new word identification in a rule-based Chinese system

CLPW '00 Proceedings of the second workshop on Chinese language processing: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 12
Chinese segmentation and new word detection using conditional random fields

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Unsupervised segmentation of Chinese text by use of branching entropy

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Automatic extraction of new words based on Google News corpora for supporting lexicon-based Chinese word segmentation systems

Expert Systems with Applications: An International Journal
Chinese Unknown Word Recognition Using Improved Conditional Random Fields

ISDA '08 Proceedings of the 2008 Eighth International Conference on Intelligent Systems Design and Applications - Volume 02
Modeling latent-dynamic in shallow parsing: a latent conditional model with improved inference

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Sequential labeling with latent variables: an exact inference algorithm and its efficient approximation

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Chinese Unknown Words Extraction Based on Word-Level Characteristics

HIS '09 Proceedings of the 2009 Ninth International Conference on Hybrid Intelligent Systems - Volume 01
A Unified Character-Based Tagging Framework for Chinese Word Segmentation

ACM Transactions on Asian Language Information Processing (TALIP)
Acquisition of unknown word paradigms for large-scale grammars

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Word-based and character-based word segmentation models: comparison and combination

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Statistical substring reduction in linear time

IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
Unsupervised segmentation of chinese corpus using accessor variety

IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
Unknown word extraction from multilingual code-switching sentences

ROCLING '11 ROCLING 2011 Poster Papers
Fast online training with frequency-adaptive learning rates for Chinese word segmentation and new word detection

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1

Quantified Score

Hi-index	0.00

Visualization

Abstract

Not all languages, e.g. Chinese, have delimiters for words. To extract words from a sentence in these languages, we usually rely on a dictionary for known words. For unknown words, some approaches rely on a domain specific dictionary or a tailor-made learning data set. However, this information may not be available. Another direction is to use unsupervised methods. These methods rely on a goodness measure to evaluate how likely the words are meaningful based on a statistical argument on the given text. The most challenging issue is to identify low-frequency meaningful words. In this paper, we first show by an empirical study on Chinese texts that all classical goodness measures cannot separate low-frequency meaningful and meaningless words effectively. To solve this problem, we propose a new goodness measure, the overlap variety method. The key idea behind the new measure is not to consider the absolute number of occurrences of the candidate (i.e., a string of Chinese characters) but to compare the goodness measures (we use the accessor variety) of the candidate and those of the strings overlapping the candidate. The candidate is likely to be meaningful if its accessor variety is larger than the accessor varieties of the overlapping strings. We implement an extraction system for unknown Chinese word, UNExtract, based on this overlap variety method. We evaluate our approach using the CIPS-SIGHAN-2010 bake off corpora and show that the proposed measure is more effective than the other five state-of-the-art goodness measures (accessor variety, branch entropy, description length gain, frequency substring reduction, pointwise mutual information), especially for low-frequency words and bi-gram words.