Chinese text segmentation for text retrieval: achievements and problems
Journal of the American Society for Information Science
A stochastic finite-state word-segmentation algorithm for Chinese
Computational Linguistics
Comparing representations in Chinese information retrieval
Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
PAT-tree-based keyword extraction for Chinese information retrieval
Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Statistical Models for Text Segmentation
Machine Learning - Special issue on natural language learning
A new statistical formula for Chinese text segmentation incorporating contextual information
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Discovering Chinese words from unsegmented text (poster abstract)
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Modern Information Retrieval
Extracting the lowest-frequency words: pitfalls and possibilities
Computational Linguistics
A compression-based algorithm for Chinese word segmentation
Computational Linguistics
Automatic verb classification based on statistical distributions of argument structure
Computational Linguistics
MARSYAS: a framework for audio analysis
Organised Sound
Critical tokenization and its properties
Computational Linguistics
Analysis of Japanese compound nouns using collocational information
COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 2
Broad coverage automatic morphological segmentation of German words
COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 4
Extraction of Chinese compound words: an experimental study on a very large corpus
CLPW '00 Proceedings of the second workshop on Chinese language processing: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 12
Contextual dependencies in unsupervised word segmentation
ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
A Study on Multi-word Extraction from Chinese Documents
Advanced Web and NetworkTechnologies, and Applications
Chinese term extraction using minimal resources
COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Character-level dependencies in Chinese: usefulness and learning
EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Punctuation as implicit annotations for chinese word segmentation
Computational Linguistics
A delimiter-based general approach for Chinese term extraction
Journal of the American Society for Information Science and Technology
Integrating unsupervised and supervised word segmentation: The role of goodness measures
Information Sciences: an International Journal
Semantic entity detection by integrating CRF and SVM
WAIM'10 Proceedings of the 11th international conference on Web-age information management
Incremental Chinese lexicon extraction with minimal resources on a domain-specific corpus
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
A new unsupervised approach to word segmentation
Computational Linguistics
Enhancing Chinese word segmentation using unlabeled data
EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
ROCLING '11 Proceedings of the 23rd Conference on Computational Linguistics and Speech Processing
Applying collocation segmentation to the ACL anthology reference corpus
ACL '12 Proceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries
Unsupervized word segmentation: the case for Mandarin Chinese
ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2
NEWS '12 Proceedings of the 4th Named Entity Workshop
Unknown Chinese word extraction based on variety of overlapping strings
Information Processing and Management: an International Journal
An empirical study on word segmentation for chinese machine translation
CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume 2
The application of kalman filter based human-computer learning model to chinese word segmentation
CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part I
Hi-index | 0.00 |
We are interested in the problem of word extraction from Chinese text collections. We define a word to be a meaningful string composed of several Chinese characters. For example, 'percent', and, 'more and more', are not recognized as traditional Chinese words from the viewpoint of some people. However, in our work, they are words because they are very widely used and have specific meanings. We start with the viewpoint that a word is a distinguished linguistic entity that can be used in many different language environments. We consider the characters that are directly before a string (predecessors) and the characters that are directly after a string (successors) as important factors for determining the independence of the string. We call such characters accessors of the string, consider the number of distinct predecessors and successors of a string in a large corpus (TREC 5 and TREC 6 documents), and use them as the measurement of the context independency of a string from the rest of the sentences in the document. Our experiments confirm our hypothesis and show that this simple rule gives quite good results for Chinese word extraction and is comparable to, and for long words outperforms, other iterative methods.