A compression-based algorithm for Chinese word segmentation

Authors:
W. J. Teahan;Rodger McNab;Yingying Wen;Ian H. Witten
Affiliations:
The Robert Gordon University;University of Waikato;University of Waikato;University of Waikato
Venue:
Computational Linguistics
Year:
2000

Citing 16
Cited 34

Text compression

Text compression
Introduction to algorithms

Introduction to algorithms
Stemming algorithms

Information retrieval
Chinese text segmentation for text retrieval: achievements and problems

Journal of the American Society for Information Science
The design and analysis of efficient lossless data compression systems

The design and analysis of efficient lossless data compression systems
Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging

Computational Linguistics
A stochastic finite-state word-segmentation algorithm for Chinese

Computational Linguistics
A study on word-based and integral-bit Chinese text compression algorithms

Journal of the American Society for Information Science
Text segmentation for chinese spell checking

Journal of the American Society for Information Science
A new statistical formula for Chinese text segmentation incorporating contextual information

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
Domain-Specific Keyphrase Extraction

IJCAI '99 Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence
Correcting English Text Using PPM Models

DCC '98 Proceedings of the Conference on Data Compression
USe: A Retargetable Word Segmentation Procedure for Information Retrieval

USe: A Retargetable Word Segmentation Procedure for Information Retrieval
Improving Chinese tokenization with linguistic filters on statistical lexical acquisition

ANLC '94 Proceedings of the fourth conference on Applied natural language processing
A trainable rule-based algorithm for word segmentation

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics

Using statistical and contextual information to identify two-and three-character words in Chinese text

Journal of the American Society for Information Science and Technology
Learning Structure from Sequences, with Applications in a Digital Library

ALT '02 Proceedings of the 13th International Conference on Algorithmic Learning Theory
An Unsupervised Algorithm for Segmenting Categorical Timeseries into Episodes

Proceedings of the ESF Exploratory Workshop on Pattern Detection and Discovery
An Algorithm for Segmenting Categorical Time Series into Meaningful Episodes

IDA '01 Proceedings of the 4th International Conference on Advances in Intelligent Data Analysis
Tag Insertion Complexity

DCC '01 Proceedings of the Data Compression Conference
Applying Machine Learning to Text Segmentation for Information Retrieval

Information Retrieval
Mostly-unsupervised statistical segmentation of Japanese Kanji sequences

Natural Language Engineering
Accessor variety criteria for Chinese word extraction

Computational Linguistics
Investigating the relationship between word segmentation performance and retrieval performance in Chinese IR

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Chinese text segmentation with MBDP-1: making the most of training corpora

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Improved source-channel models for Chinese word segmentation

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach

Computational Linguistics
Learning case-based knowledge for disambiguating Chinese word segmentation: a preliminary study

SIGHAN '02 Proceedings of the first SIGHAN workshop on Chinese language processing - Volume 18
Combining classifiers for Chinese word segmentation

SIGHAN '02 Proceedings of the first SIGHAN workshop on Chinese language processing - Volume 18
Chinese lexical analysis using hierarchical hidden Markov model

SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
Chinese Named Entity Recognition combining a statistical model with human knowledge

MultiNER '03 Proceedings of the ACL 2003 workshop on Multilingual and mixed-language named entity recognition - Volume 15
Chinese segmentation and new word detection using conditional random fields

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Combining prediction by partial matching and logistic regression for Thai word segmentation

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Word-based predictive text entry using adaptive language models

Natural Language Engineering
Compression-based data mining of sequential data

Data Mining and Knowledge Discovery
Chinese word segmentation as morpheme-based lexical chunking

Information Sciences: an International Journal
Voting experts: An unsupervised algorithm for segmenting sequences

Intelligent Data Analysis
Chinese Word Segmentation for Terrorism-Related Contents

PAISI, PACCF and SOCO '08 Proceedings of the IEEE ISI 2008 PAISI, PACCF, and SOCO international workshops on Intelligence and Security Informatics
Keyword query cleaning

Proceedings of the VLDB Endowment
Improving word segmentation by simultaneously learning phonotactics

CoNLL '08 Proceedings of the Twelfth Conference on Computational Natural Language Learning
Query segmentation based on eigenspace similarity

ACLShort '09 Proceedings of the ACL-IJCNLP 2009 Conference Short Papers
Integrating unsupervised and supervised word segmentation: The role of goodness measures

Information Sciences: an International Journal
Domain-specific Chinese word segmentation using suffix tree and mutual information

Information Systems Frontiers
Differentiating code from data in x86 binaries

ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part III
A new unsupervised approach to word segmentation

Computational Linguistics
Unsupervised segmentation of chinese corpus using accessor variety

IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
WSDL term tokenization methods for IR-style Web services discovery

Science of Computer Programming
Combined word-spacing method for disambiguating korean texts

AI'04 Proceedings of the 17th Australian joint conference on Advances in Artificial Intelligence
A new method to compose long unknown Chinese keywords

Journal of Information Science

Quantified Score

Hi-index	0.00

Visualization

Abstract

Chinese is written without using spaces or other word delimiters. Although a text may be thought of as a corresponding sequence of words, there is considerable ambiguity in the placement of boundaries. Interpreting a text as a sequence of words is beneficial for some information retrieval and storage tasks:for example, fulltext search, word-based compression, and keyphrase extraction. We describe a scheme that infers appropriate positions for word boundaries using an adaptive language model that is standard in text compression. It is trained on a corpus of presegmented text, and when applied to new text, interpolates word boundaries so as to maximize the compression obtained. This simple and general method performs well with respect to specialized schemes for Chinese language segmentation.