Word association norms, mutual information, and lexicography
Computational Linguistics
Elements of information theory
Elements of information theory
Suffix arrays: a new method for on-line string searches
SIAM Journal on Computing
Chinese text segmentation for text retrieval: achievements and problems
Journal of the American Society for Information Science
A probabilistic algorithm for segmenting non-Kanji Japanese strings
AAAI '94 Proceedings of the twelfth national conference on Artificial intelligence (vol. 1)
A stochastic finite-state word-segmentation algorithm for Chinese
Computational Linguistics
A new statistical formula for Chinese text segmentation incorporating contextual information
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Discovering Chinese words from unsegmented text (poster abstract)
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Self-Supervised Chinese Word Segmentation
IDA '01 Proceedings of the 4th International Conference on Advances in Intelligent Data Analysis
A compression-based algorithm for Chinese word segmentation
Computational Linguistics
Unsupervised learning of the morphology of a natural language
Computational Linguistics
MARSYAS: a framework for audio analysis
Organised Sound
Evaluating parsing strategies using standardized parse files
ANLC '92 Proceedings of the third conference on Applied natural language processing
A trainable rule-based algorithm for word segmentation
ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Chinese word segmentation without using lexicon and hand-crafted training data
COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
Japanese morphological analyzer using word co-occurrence: JTAG
COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
A stochastic Japanese morphological analyzer using a forward-DP backward-A* N-best search algorithm
COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 1
COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 1
Context-based spelling correction for Japanese OCR
COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2
Chinese text segmentation with MBDP-1: making the most of training corpora
ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
LINGSTAT: an interactive, machine-aided translation system
HLT '93 Proceedings of the workshop on Human Language Technology
Example-based correction of word segmentation and part of speech labelling
HLT '93 Proceedings of the workshop on Human Language Technology
Japanese word segmentation by hidden Markov model
HLT '94 Proceedings of the workshop on Human Language Technology
A maximum entropy Chinese character-based parser
EMNLP '03 Proceedings of the 2003 conference on Empirical methods in natural language processing
Unsupervised query segmentation using generative language models and wikipedia
Proceedings of the 17th international conference on World Wide Web
Bayesian unsupervised word segmentation with nested Pitman-Yor language modeling
ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
A new unsupervised approach to word segmentation
Computational Linguistics
EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Unknown Chinese word extraction based on variety of overlapping strings
Information Processing and Management: an International Journal
Hi-index | 0.00 |
Given the lack of word delimiters in written Japanese, word segmentation is generally considered a crucial first step in processing Japanese texts. Typical Japanese segmentation algorithms rely either on a lexicon and syntactic analysis or on pre-segmented data; but these are labor-intensive, and the lexico-syntactic techniques are vulnerable to the unknown word problem. In contrast, we introduce a novel, more robust statistical method utilizing unsegmented training data. Despite its simplicity, the algorithm yields performance on long kanji sequences comparable to and sometimes surpassing that of state-of-the-art morphological analyzers over a variety of error metrics. The algorithm also outperforms another mostly-unsupervised statistical algorithm previously proposed for Chinese. Additionally, we present a two-level annotation scheme for Japanese to incorporate multiple segmentation granularities, and introduce two novel evaluation metrics, both based on the notion of a compatible bracket, that can account for multiple granularities simultaneously.