Two-character Chinese word extraction based on hybrid of internal and contextual measures

Authors:
Shengfen Luo;Maosong Sun
Affiliations:
Tsinghua University, Beijing, China;Tsinghua University, Beijing, China
Venue:
SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
Year:
2003

Citing 4
Cited 9

On changing continuous attributes into ordered discrete attributes

EWSL-91 Proceedings of the European working session on learning on Machine learning
On the Handling of Continuous-Valued Attributes in Decision Tree Generation

Machine Learning
Generalizing Boundary Points

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
Automatic corpus-based Thai word extraction with the c4.5 learning algorithm

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 2

Research on Domain Term Extraction Based on Conditional Random Fields

ICCPOL '09 Proceedings of the 22nd International Conference on Computer Processing of Oriental Languages. Language Technology for the Knowledge-based Economy
Chinese Terminology Extraction Using Window-Based Contextual Information

CICLing '07 Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing
Chinese term extraction using minimal resources

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
A delimiter-based general approach for Chinese term extraction

Journal of the American Society for Information Science and Technology
Improving statistical machine translation using domain bilingual multiword expressions

MWE '09 Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications
A Google-based statistical acquisition model of Chinese lexical concepts

KSEM'07 Proceedings of the 2nd international conference on Knowledge science, engineering and management
The use of SVM for chinese new word identification

IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
A lexicon-constrained character model for chinese morphological analysis

IJCNLP'05 Proceedings of the Second international joint conference on Natural Language Processing
Extracting terminologically relevant collocations in the translation of chinese monograph

IJCNLP'05 Proceedings of the Second international joint conference on Natural Language Processing

Quantified Score

Hi-index	0.01

Visualization

Abstract

Word extraction is one of the important tasks in text information processing. There are mainly two kinds of statistic-based measures for word extraction: the internal measure and the contextual measure. This paper discusses these two kinds of measures for Chinese word extraction. First, nine widely adopted internal measures are tested and compared on individual basis. Then various schemes of combining these measures are tried so as to improve the performance. Finally, the left/right entropy is integrated to see the effect of contextual measures. Genetic algorithm is explored to automatically adjust the weights of combination and thresholds. Experiments focusing on two-character Chinese word extraction show a promising result: the F-measure of mutual information, the most powerful internal measure, is 57.82%, whereas the best combination scheme of internal measures achieves the F-measure of 59.87%. With the integration of the contextual measure, the word extraction achieves the F-measure of 68.48% at last.