Mostly-unsupervised statistical segmentation of Japanese: applications to kanji

Authors:
Rie Kubota Ando;Lillian Lee
Affiliations:
Cornell University, Ithaca, NY;Cornell University, Ithaca, NY
Venue:
NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
Year:
2000

Citing 15
Cited 12

Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
Chinese text segmentation for text retrieval: achievements and problems

Journal of the American Society for Information Science
A probabilistic algorithm for segmenting non-Kanji Japanese strings

AAAI '94 Proceedings of the twelfth national conference on Artificial intelligence (vol. 1)
A stochastic finite-state word-segmentation algorithm for Chinese

Computational Linguistics
Evaluating parsing strategies using standardized parse files

ANLC '92 Proceedings of the third conference on Applied natural language processing
A trainable rule-based algorithm for word segmentation

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Chinese word segmentation without using lexicon and hand-crafted training data

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
Japanese morphological analyzer using word co-occurrence: JTAG

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Use of mutual information based character clusters in dictionary-less morphological analysis of Japanese

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
A stochastic Japanese morphological analyzer using a forward-DP backward-A* N-best search algorithm

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 1
A new method of N-gram statistics for large number of n and automatic extraction of words and phrases from large text data of Japanese

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 1
Context-based spelling correction for Japanese OCR

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2
LINGSTAT: an interactive, machine-aided translation system

HLT '93 Proceedings of the workshop on Human Language Technology
Example-based correction of word segmentation and part of speech labelling

HLT '93 Proceedings of the workshop on Human Language Technology
Japanese word segmentation by hidden Markov model

HLT '94 Proceedings of the workshop on Human Language Technology

A Statistical Corpus-Based Term Extractor

AI '01 Proceedings of the 14th Biennial Conference of the Canadian Society on Computational Studies of Intelligence: Advances in Artificial Intelligence
An Unsupervised Algorithm for Segmenting Categorical Timeseries into Episodes

Proceedings of the ESF Exploratory Workshop on Pattern Detection and Discovery
Self-Supervised Chinese Word Segmentation

IDA '01 Proceedings of the 4th International Conference on Advances in Intelligent Data Analysis
A non-programming introduction to computer science via NLP, IR, and AI

ETMTNLP '02 Proceedings of the ACL-02 Workshop on Effective tools and methodologies for teaching natural language processing and computational linguistics - Volume 1
A unified language model for large vocabulary continuous speech recognition of Turkish

Signal Processing - Fractional calculus applications in signals and systems
Unsupervised models for morpheme segmentation and morphology learning

ACM Transactions on Speech and Language Processing (TSLP)
Unsupervised segmentation of Chinese text by use of branching entropy

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Voting experts: An unsupervised algorithm for segmenting sequences

Intelligent Data Analysis
Integrating unsupervised and supervised word segmentation: The role of goodness measures

Information Sciences: an International Journal
A new unsupervised approach to word segmentation

Computational Linguistics
Unsupervised segmentation of chinese corpus using accessor variety

IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
Entropy as an indicator of context boundaries: an experiment using a web search engine

IJCNLP'05 Proceedings of the Second international joint conference on Natural Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Given the lack of word delimiters in written Japanese, word segmentation is generally considered a crucial first step in processing Japanese texts. Typical Japanese segmentation algorithms rely either on a lexicon and grammar or on pre-segmented data. In contrast, we introduce a novel statistical method utilizing unsegmented training data, with performance on kanji sequences comparable to and sometimes surpassing that of morphological analyzers over a variety of error metrics.