A bottom-up merging algorithm for Chinese unknown word extraction

Authors:
Wei-Yun Ma;Keh-Jiann Chen
Affiliations:
Institute of Information science, Academia Sinica;Institute of Information science, Academia Sinica
Venue:
SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
Year:
2003

Citing 11
Cited 9

Word association norms, mutual information, and lexicography

Computational Linguistics
Natural language understanding (2nd ed.)

Natural language understanding (2nd ed.)
Translating collocations for bilingual lexicons: a statistical approach

Computational Linguistics
Discovering Chinese words from unsegmented text (poster abstract)

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Retrieving collocations from text: Xtract

Computational Linguistics - Special issue on using large corpora: I
A trainable rule-based algorithm for word segmentation

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Chinese word segmentation without using lexicon and hand-crafted training data

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
Empirical estimates of adaptation: the chance of two noriegas is closer to p/2 than p2

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Word identification for Mandarin Chinese sentences

COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 1
Segmentation standard for Chinese natural language processing

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2
Unknown word extraction for Chinese documents

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1

Introduction to CKIP Chinese word segmentation system for the first international Chinese Word Segmentation Bakeoff

SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
Implementation and performance evaluation of parameter improvement mechanisms for intelligent e-learning systems

Computers & Education
Supporting the development of collaborative problem-based learning environments with an intelligent diagnosis tool

Expert Systems with Applications: An International Journal
Automatic extraction of new words based on Google News corpora for supporting lexicon-based Chinese word segmentation systems

Expert Systems with Applications: An International Journal
A search mechanism based on ontology technology for students in elementary school

WSEAS Transactions on Information Science and Applications
Chinese text segmentation: A hybrid approach using transductive learning and statistical association measures

Expert Systems with Applications: An International Journal
Realization of a news dissemination agent based on weighted association rules and text mining techniques

Expert Systems with Applications: An International Journal
Fusion of multiple features and supervised learning for Chinese OOV term detection and POS guessing

IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Three
Phrase-based approach for adaptive tokenization

SIGMORPHON '12 Proceedings of the Twelfth Meeting of the Special Interest Group on Computational Morphology and Phonology

Quantified Score

Hi-index	0.01

Visualization

Abstract

Statistical methods for extracting Chinese unknown words usually suffer a problem that superfluous character strings with strong statistical associations are extracted as well. To solve this problem, this paper proposes to use a set of general morphological rules to broaden the coverage and on the other hand, the rules are appended with different linguistic and statistical constraints to increase the precision of the representation. To disambiguate rule applications and reduce the complexity of the rule matching, a bottom-up merging algorithm for extraction is proposed, which merges possible morphemes recursively by consulting above the general rules and dynamically decides which rule should be applied first according to the priorities of the rules. Effects of different priority strategies are compared in our experiment, and experimental results show that the performance of proposed method is very promising.