Chinese new word identification: a latent discriminative model with global features

Authors:
Xiao Sun;De-Gen Huang;Hai-Yu Song;Fu-Ji Ren
Affiliations:
School of Computer Science and Engineering, Dalian Nationalities University, Dalian, China;School of Computer Science and Engineering, Dalian University of Technology, Dalian, China;School of Computer Science and Engineering, Dalian Nationalities University, Dalian, China;Department of Information Science and Intelligent Systems, Tokushima University, Tokushima, Japan
Venue:
Journal of Computer Science and Technology - Special issue on natural language processing
Year:
2011

Citing 14
Cited 1

On the limited memory BFGS method for large scale optimization

Mathematical Programming: Series A and B
A stochastic finite-state word-segmentation algorithm for Chinese

Computational Linguistics
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Chinese unknown word identification using character-based tagging and chunking

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 2
Statistically-enhanced new word identification in a rule-based Chinese system

CLPW '00 Proceedings of the second workshop on Chinese language processing: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 12
The first international Chinese word segmentation Bakeoff

SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
Chinese word segmentation using minimal linguistic knowledge

SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
Improving the scalability of semi-Markov conditional random fields for named entity recognition

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Japanese unknown word identification by character-based chunking

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Chinese segmentation and new word detection using conditional random fields

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Predicting chinese abbreviations from definitions: an empirical learning approach using support vector regression

Journal of Computer Science and Technology
Scaling conditional random fields by one-against-the-other decomposition

Journal of Computer Science and Technology
The use of SVM for chinese new word identification

IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
A chunking strategy towards unknown word detection in chinese word segmentation

IJCNLP'05 Proceedings of the Second international joint conference on Natural Language Processing

A new method to compose long unknown Chinese keywords

Journal of Information Science

Quantified Score

Hi-index	0.00

Visualization

Abstract

Chinese new words are particularly problematic in Chinese natural language processing. With the fast development of Internet and information explosion, it is impossible to get a complete system lexicon for applications in Chinese natural language processing, as new words out of dictionaries are always being created. The procedure of Hew words identification and POS tagging are usually separated and the features of lexical information cannot be fully used. A latent discriminative model, which combines the strengths of Latent Dynamic Conditional Random Field (LDCRF) and semi-CRF, is proposed to detect new words together with their POS synchronously regardless of the types of new words from Chinese text without being pre-segmented. Unlike semi-CRF, in proposed latent discriminative model, LDCRF is applied 10 generate candidate entities, which accelerates the training speed and decreases the computational cost. The complexity of proposed hidden semi-CRF could be further adjusted by tuning the number of hidden variables and the number of candidate entities from the Nbest outputs of LDCRF model. A new-word-generating framework is proposed for model training and testing, under which the definitions and distributions of new words conform to the ones in real text. The global feature called "Global Fragment Features" for new word identification is adopted. We tested our model on the corpus from SIGHAN-6. Experimental results show that the proposed method is capable of detecting even low frequency new words together with their POS tags with satisfactory results. The proposed model performs competitively with the state-of-the-art models.