Accessor variety criteria for Chinese word extraction

Authors:
Haodi Feng;Kang Chen;Xiaotie Deng;Weimin Zheng
Affiliations:
Shandong University, City University of Hong Kong, School of Computer Science and Technology, Jinan, PRC/ Department of Computer Science, Tat Chee Avenue, Kowloon, Hong Kong;Tsinghua University, Department of Computer Science and Technology, Peking, PR China;City University of Hong Kong, Department of Computer Science, Tat Chee Avenue, Kowloon, Hong Kong;Tsinghua University, Department of Computer Science and Technology, Peking, PR China
Venue:
Computational Linguistics
Year:
2004

Citing 16
Cited 18

Chinese text segmentation for text retrieval: achievements and problems

Journal of the American Society for Information Science
A stochastic finite-state word-segmentation algorithm for Chinese

Computational Linguistics
Comparing representations in Chinese information retrieval

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
PAT-tree-based keyword extraction for Chinese information retrieval

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Statistical Models for Text Segmentation

Machine Learning - Special issue on natural language learning
A new statistical formula for Chinese text segmentation incorporating contextual information

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Discovering Chinese words from unsegmented text (poster abstract)

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Modern Information Retrieval

Modern Information Retrieval
Extracting the lowest-frequency words: pitfalls and possibilities

Computational Linguistics
A compression-based algorithm for Chinese word segmentation

Computational Linguistics
Automatic verb classification based on statistical distributions of argument structure

Computational Linguistics
MARSYAS: a framework for audio analysis

Organised Sound
Critical tokenization and its properties

Computational Linguistics
Analysis of Japanese compound nouns using collocational information

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 2
Broad coverage automatic morphological segmentation of German words

COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 4
Extraction of Chinese compound words: an experimental study on a very large corpus

CLPW '00 Proceedings of the second workshop on Chinese language processing: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 12

Contextual dependencies in unsupervised word segmentation

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
A Study on Multi-word Extraction from Chinese Documents

Advanced Web and NetworkTechnologies, and Applications
Chinese term extraction using minimal resources

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Character-level dependencies in Chinese: usefulness and learning

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Punctuation as implicit annotations for chinese word segmentation

Computational Linguistics
A delimiter-based general approach for Chinese term extraction

Journal of the American Society for Information Science and Technology
Integrating unsupervised and supervised word segmentation: The role of goodness measures

Information Sciences: an International Journal
Semantic entity detection by integrating CRF and SVM

WAIM'10 Proceedings of the 11th international conference on Web-age information management
Incremental Chinese lexicon extraction with minimal resources on a domain-specific corpus

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
A new unsupervised approach to word segmentation

Computational Linguistics
Enhancing Chinese word segmentation using unlabeled data

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Unsupervised overlapping feature selection for conditional random fields learning in Chinese word segmentation

ROCLING '11 Proceedings of the 23rd Conference on Computational Linguistics and Speech Processing
Applying collocation segmentation to the ACL anthology reference corpus

ACL '12 Proceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries
Unsupervized word segmentation: the case for Mandarin Chinese

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2
Cost-benefit analysis of two-stage conditional random fields based English-to-Chinese machine transliteration

NEWS '12 Proceedings of the 4th Named Entity Workshop
Unknown Chinese word extraction based on variety of overlapping strings

Information Processing and Management: an International Journal
An empirical study on word segmentation for chinese machine translation

CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume 2
The application of kalman filter based human-computer learning model to chinese word segmentation

CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

We are interested in the problem of word extraction from Chinese text collections. We define a word to be a meaningful string composed of several Chinese characters. For example, 'percent', and, 'more and more', are not recognized as traditional Chinese words from the viewpoint of some people. However, in our work, they are words because they are very widely used and have specific meanings. We start with the viewpoint that a word is a distinguished linguistic entity that can be used in many different language environments. We consider the characters that are directly before a string (predecessors) and the characters that are directly after a string (successors) as important factors for determining the independence of the string. We call such characters accessors of the string, consider the number of distinct predecessors and successors of a string in a large corpus (TREC 5 and TREC 6 documents), and use them as the measurement of the context independency of a string from the rest of the sentences in the document. Our experiments confirm our hypothesis and show that this simple rule gives quite good results for Chinese word extraction and is comparable to, and for long words outperforms, other iterative methods.