Extraction of Chinese compound words: an experimental study on a very large corpus

Authors:
Jian Zhang;Jianfeng Gao;Ming Zhou
Affiliations:
Tsinghua University, China;Microsoft Research China;Microsoft Research China
Venue:
CLPW '00 Proceedings of the second workshop on Chinese language processing: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 12
Year:
2000

Citing 2
Cited 10

PAT-tree-based keyword extraction for Chinese information retrieval

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Implementation of the SMART Information Retrieval System

Implementation of the SMART Information Retrieval System

Toward a unified approach to statistical language modeling for Chinese

ACM Transactions on Asian Language Information Processing (TALIP)
Accessor variety criteria for Chinese word extraction

Computational Linguistics
Text classification based on multi-word with support vector machine

Knowledge-Based Systems
A Study on Multi-word Extraction from Chinese Documents

Advanced Web and NetworkTechnologies, and Applications
Alignment-based surface patterns for factoid question answering systems

Integrated Computer-Aided Engineering - Selected papers from the IEEE Conference on Information Reuse and Integration (IRI), July 13-15, 2008
What is at stake: a case study of Russian expressions starting with a preposition

MWE '04 Proceedings of the Workshop on Multiword Expressions: Integrating Processing
Integrating unsupervised and supervised word segmentation: The role of goodness measures

Information Sciences: an International Journal
Toward enhanced Arabic speech recognition using part of speech tagging

International Journal of Speech Technology
The application of kalman filter based human-computer learning model to chinese word segmentation

CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part I
Juggling the Jigsaw: towards automated problem inference from network trouble tickets

nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper is to introduce a statistical method to extract Chinese compound words from a very large corpus. This method is based on mutual information and context dependency. Experimental results show that this method is efficient and robust compared with other approaches. We also examined the impact of different parameter settings, corpus size and heterogeneousness on the extraction results. We finally present results on information retrieval to show the usefulness of extracted compounds.