Text compression
Introduction to algorithms
Information retrieval
Chinese text segmentation for text retrieval: achievements and problems
Journal of the American Society for Information Science
The design and analysis of efficient lossless data compression systems
The design and analysis of efficient lossless data compression systems
A stochastic finite-state word-segmentation algorithm for Chinese
Computational Linguistics
A study on word-based and integral-bit Chinese text compression algorithms
Journal of the American Society for Information Science
Text segmentation for chinese spell checking
Journal of the American Society for Information Science
A new statistical formula for Chinese text segmentation incorporating contextual information
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Managing gigabytes (2nd ed.): compressing and indexing documents and images
Managing gigabytes (2nd ed.): compressing and indexing documents and images
Domain-Specific Keyphrase Extraction
IJCAI '99 Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence
Correcting English Text Using PPM Models
DCC '98 Proceedings of the Conference on Data Compression
USe: A Retargetable Word Segmentation Procedure for Information Retrieval
USe: A Retargetable Word Segmentation Procedure for Information Retrieval
Improving Chinese tokenization with linguistic filters on statistical lexical acquisition
ANLC '94 Proceedings of the fourth conference on Applied natural language processing
A trainable rule-based algorithm for word segmentation
ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Journal of the American Society for Information Science and Technology
Learning Structure from Sequences, with Applications in a Digital Library
ALT '02 Proceedings of the 13th International Conference on Algorithmic Learning Theory
An Unsupervised Algorithm for Segmenting Categorical Timeseries into Episodes
Proceedings of the ESF Exploratory Workshop on Pattern Detection and Discovery
An Algorithm for Segmenting Categorical Time Series into Meaningful Episodes
IDA '01 Proceedings of the 4th International Conference on Advances in Intelligent Data Analysis
DCC '01 Proceedings of the Data Compression Conference
Applying Machine Learning to Text Segmentation for Information Retrieval
Information Retrieval
Mostly-unsupervised statistical segmentation of Japanese Kanji sequences
Natural Language Engineering
Accessor variety criteria for Chinese word extraction
Computational Linguistics
COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Chinese text segmentation with MBDP-1: making the most of training corpora
ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Improved source-channel models for Chinese word segmentation
ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach
Computational Linguistics
Learning case-based knowledge for disambiguating Chinese word segmentation: a preliminary study
SIGHAN '02 Proceedings of the first SIGHAN workshop on Chinese language processing - Volume 18
Combining classifiers for Chinese word segmentation
SIGHAN '02 Proceedings of the first SIGHAN workshop on Chinese language processing - Volume 18
Chinese lexical analysis using hierarchical hidden Markov model
SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
Chinese Named Entity Recognition combining a statistical model with human knowledge
MultiNER '03 Proceedings of the ACL 2003 workshop on Multilingual and mixed-language named entity recognition - Volume 15
Chinese segmentation and new word detection using conditional random fields
COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Combining prediction by partial matching and logistic regression for Thai word segmentation
COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Word-based predictive text entry using adaptive language models
Natural Language Engineering
Compression-based data mining of sequential data
Data Mining and Knowledge Discovery
Chinese word segmentation as morpheme-based lexical chunking
Information Sciences: an International Journal
Voting experts: An unsupervised algorithm for segmenting sequences
Intelligent Data Analysis
Chinese Word Segmentation for Terrorism-Related Contents
PAISI, PACCF and SOCO '08 Proceedings of the IEEE ISI 2008 PAISI, PACCF, and SOCO international workshops on Intelligence and Security Informatics
Proceedings of the VLDB Endowment
Improving word segmentation by simultaneously learning phonotactics
CoNLL '08 Proceedings of the Twelfth Conference on Computational Natural Language Learning
Query segmentation based on eigenspace similarity
ACLShort '09 Proceedings of the ACL-IJCNLP 2009 Conference Short Papers
Integrating unsupervised and supervised word segmentation: The role of goodness measures
Information Sciences: an International Journal
Domain-specific Chinese word segmentation using suffix tree and mutual information
Information Systems Frontiers
Differentiating code from data in x86 binaries
ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part III
A new unsupervised approach to word segmentation
Computational Linguistics
Unsupervised segmentation of chinese corpus using accessor variety
IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
WSDL term tokenization methods for IR-style Web services discovery
Science of Computer Programming
Combined word-spacing method for disambiguating korean texts
AI'04 Proceedings of the 17th Australian joint conference on Advances in Artificial Intelligence
A new method to compose long unknown Chinese keywords
Journal of Information Science
Hi-index | 0.00 |
Chinese is written without using spaces or other word delimiters. Although a text may be thought of as a corresponding sequence of words, there is considerable ambiguity in the placement of boundaries. Interpreting a text as a sequence of words is beneficial for some information retrieval and storage tasks:for example, fulltext search, word-based compression, and keyphrase extraction. We describe a scheme that infers appropriate positions for word boundaries using an adaptive language model that is standard in text compression. It is trained on a corpus of presegmented text, and when applied to new text, interpolates word boundaries so as to maximize the compression obtained. This simple and general method performs well with respect to specialized schemes for Chinese language segmentation.