The nature of statistical learning theory
The nature of statistical learning theory
Toward a unified approach to statistical language modeling for Chinese
ACM Transactions on Asian Language Information Processing (TALIP)
Estimating the Generalization Performance of an SVM Efficiently
ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Improved source-channel models for Chinese word segmentation
ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Statistically-enhanced new word identification in a rule-based Chinese system
CLPW '00 Proceedings of the second workshop on Chinese language processing: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 12
Two-character Chinese word extraction based on hybrid of internal and contextual measures
SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
The first international Chinese word segmentation Bakeoff
SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
Chinese word segmentation using minimal linguistic knowledge
SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
Chinese word segmentation in MSR-NLP
SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
HHMM-based Chinese lexical analyzer ICTCLAS
SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach
Computational Linguistics
Adaptive Chinese word segmentation
ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Applying Machine Learning to Chinese Entity Detection and Tracking
CICLing '07 Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing
Chinese term extraction using minimal resources
COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Incorporating user behaviors in new word detection
IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence
Chinese term extraction using different types of relevance
ACLShort '09 Proceedings of the ACL-IJCNLP 2009 Conference Short Papers
Chinese new word identification: a latent discriminative model with global features
Journal of Computer Science and Technology - Special issue on natural language processing
User Behaviors in Related Word Retrieval and New Word Detection: A Collaborative Perspective
ACM Transactions on Asian Language Information Processing (TALIP)
Revising word lattice using support vector machine for Chinese word segmentation
Proceedings of the 14th International Conference on Information Integration and Web-based Applications & Services
Hi-index | 0.00 |
We present a study of new word identification (NWI) to improve the performance of a Chinese word segmenter. In this paper the distribution and types of new words are discussed empirically. In particular, we focus on the new words of two surface patterns, which account for more than 80% of new words in our data sets: NW11 (two-character new word) and NW21 (a bi-character word followed with a single character). NWI is defined as a problem of binary classification. A statistical learning approach based on a SVM classifier is used. Different features for NWI are explored, including in-word probability of a character (IWP), the analogy between new words and lexicon words, anti-word list, and frequency in documents. The experiments show that these features are useful for NWI. The F-scores of NWI we achieved are 64.4% and 54.7% for NW11 and NW21, respectively. The overall performance of the Chinese word segmenter could be improved by Roov 24.5% and F-score 6.5% in PK-close test of the 1st SIGHAN bakeoff. This achieves the performance of state-of-the-art word segmenters.