A stochastic finite-state word-segmentation algorithm for Chinese

Authors:
Richard Sproat;William Gale;Chilin Shih;Nancy Chang
Affiliations:
Bell Laboratories;Bell Laboratories;Bell Laboratories;University of Cambridge
Venue:
Computational Linguistics
Year:
1996

Citing 13
Cited 78

Pitch accent in context: predicting intonational prominence from text

Artificial Intelligence - Special volume on natural language processing
Chinese text segmentation for text retrieval: achievements and problems

Journal of the American Society for Information Science
Regular models of phonological rule systems

Computational Linguistics - Special issue on computational phonology
Minimization algorithms for sequential transducers

Theoretical Computer Science
Improving Chinese tokenization with linguistic filters on statistical lexical acquisition

ANLC '94 Proceedings of the fourth conference on Applied natural language processing
Constituent-based morphological parsing: a new approach to the problem of word-recognition.

ACL '87 Proceedings of the 25th annual meeting on Association for Computational Linguistics
Word association norms, mutual information, and lexicography

ACL '89 Proceedings of the 27th annual meeting on Association for Computational Linguistics
A finite-state morphological processor for Spanish

COLING '90 Proceedings of the 13th conference on Computational linguistics - Volume 3
A stochastic Japanese morphological analyzer using a forward-DP backward-A* N-best search algorithm

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 1
Word identification for Mandarin Chinese sentences

COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 1
Two-level morphology with composition

COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 1
Recognizing unregistered names for Mandarin word identification

COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 4
Weighted rational transductions and their application to human language processing

HLT '94 Proceedings of the workshop on Human Language Technology

A statistically emergent approach for language processing: application to modeling context effects in ambiguous Chinese word boundary perception

Computational Linguistics
Comparing representations in Chinese information retrieval

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Chinese text retrieval without using a dictionary

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
A new statistical formula for Chinese text segmentation incorporating contextual information

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Using statistical and contextual information to identify two-and three-character words in Chinese text

Journal of the American Society for Information Science and Technology
Enhancing access to the levy sheet music collection: reconstructing full-text lyrics from syllables

Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries
Improving English and Chinese Ad-Hoc Retrieval: A Tipster Text Phase 3 Project Report

Information Retrieval
Self-Supervised Chinese Word Segmentation

IDA '01 Proceedings of the 4th International Conference on Advances in Intelligent Data Analysis
Universal Segmentation of Text with the Sumo Formalism

NLP '00 Proceedings of the Second International Conference on Natural Language Processing
Learning pattern rules for Chinese named entity extraction

Eighteenth national conference on Artificial intelligence
A compression-based algorithm for Chinese word segmentation

Computational Linguistics
Critical tokenization and its properties

Computational Linguistics
Mostly-unsupervised statistical segmentation of Japanese Kanji sequences

Natural Language Engineering
Dedication to William A. Gale

Natural Language Engineering
Mostly-unsupervised statistical segmentation of Japanese: applications to kanji

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
Multilingual text analysis for text-to-speech synthesis

Natural Language Engineering
Chinese word segmentation and its effect on information retrieval

Information Processing and Management: an International Journal
A trainable rule-based algorithm for word segmentation

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
One tokenization per source

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
A formalism for universal segmentation of text

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 2
Accessor variety criteria for Chinese word extraction

Computational Linguistics
A part of speech estimation method for Japanese unknown words using a statistical model of morphology and context

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
A maximum-entropy chinese parser augmented by transformation-based learning

ACM Transactions on Asian Language Information Processing (TALIP)
The Penn Chinese TreeBank: Phrase structure annotation of a large corpus

Natural Language Engineering
Unknown word extraction for Chinese documents

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
An agent-based approach to Chinese named entity recognition

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Simple features for Chinese word sense disambiguation

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
A functional toolkit for morphological and phonological processing, application to a Sanskrit tagger

Journal of Functional Programming
Improved source-channel models for Chinese word segmentation

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach

Computational Linguistics
Chinese-Japanese cross language information retrieval: a Han character based approach

WWSM '00 Proceedings of the ACL-2000 workshop on Word senses and multi-linguality - Volume 8
Statistically-enhanced new word identification in a rule-based Chinese system

CLPW '00 Proceedings of the second workshop on Chinese language processing: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 12
Multidimensional transformation-based learning

ConLL '01 Proceedings of the 2001 workshop on Computational Natural Language Learning - Volume 7
A character-net based Chinese text segmentation method

SEMANET '02 Proceedings of the 2002 workshop on Building and using semantic networks - Volume 11
Learning case-based knowledge for disambiguating Chinese word segmentation: a preliminary study

SIGHAN '02 Proceedings of the first SIGHAN workshop on Chinese language processing - Volume 18
A word segmentation method with dynamic adapting to text using inductive learning

SIGHAN '02 Proceedings of the first SIGHAN workshop on Chinese language processing - Volume 18
Chinese word segmentation as LMR tagging

SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
A maximum entropy Chinese character-based parser

EMNLP '03 Proceedings of the 2003 conference on Empirical methods in natural language processing
Automatic thesaurus development: Term extraction from title metadata

Journal of the American Society for Information Science and Technology - Research Articles
Named entity transliteration with comparable corpora

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Discriminative pruning of language models for Chinese word segmentation

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Chinese and Japanese word segmentation using word-level and character-level information

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Text analysis and language identification for polyglot text-to-speech synthesis

Speech Communication
Unsupervised query segmentation using generative language models and wikipedia

Proceedings of the 17th international conference on World Wide Web
Applications of corpus-based semantic similarity and word segmentation to database schema matching

The VLDB Journal — The International Journal on Very Large Data Bases
Chinese Word Segmentation for Terrorism-Related Contents

PAISI, PACCF and SOCO '08 Proceedings of the IEEE ISI 2008 PAISI, PACCF, and SOCO international workshops on Intelligence and Security Informatics
A Joint Segmenting and Labeling Approach for Chinese Lexical Analysis

ECML PKDD '08 Proceedings of the European conference on Machine Learning and Knowledge Discovery in Databases - Part II
Bilingually Motivated Word Segmentation for Statistical Machine Translation

ACM Transactions on Asian Language Information Processing (TALIP)
Combining Language Modeling and Discriminative Classification for Word Segmentation

CICLing '09 Proceedings of the 10th International Conference on Computational Linguistics and Intelligent Text Processing
A Generalized Approach to Word Segmentation Using Maximum Length Descending Frequency and Entropy Rate

CICLing '07 Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing
Training Global Linear Models for Chinese Word Segmentation

Canadian AI '09 Proceedings of the 22nd Canadian Conference on Artificial Intelligence: Advances in Artificial Intelligence
Bilingually motivated domain-adapted word segmentation for statistical machine translation

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Unsupervised named entity transliteration using temporal and phonetic correlation

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
The impact of morphological stemming on Arabic mention detection and coreference resolution

Semitic '05 Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages
Chinese-Japanese cross language information retrieval: a Han character based approach

WorkSense '00 Proceedings of the ACL-2000 Workshop on Word Senses and Multi-Linguality
Graphemic approximation of phonological context for English-Chinese transliteration

NEWS '09 Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration
A Unified Character-Based Tagging Framework for Chinese Word Segmentation

ACM Transactions on Asian Language Information Processing (TALIP)
Bayesian inference for finite-state transducers

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Large-scale language modeling with random forests for mandarin Chinese speech-to-text

IceTAL'10 Proceedings of the 7th international conference on Advances in natural language processing
Domain-specific Chinese word segmentation using suffix tree and mutual information

Information Systems Frontiers
Syntactic processing using the generalized perceptron and beam search

Computational Linguistics
Chinese new word identification: a latent discriminative model with global features

Journal of Computer Science and Technology - Special issue on natural language processing
Parsing the internal structure of words: a new paradigm for Chinese word segmentation

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
How many multiword expressions do people know?

MWE '11 Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World
A new unsupervised approach to word segmentation

Computational Linguistics
Chinese abbreviation identification using abbreviation-template features and context information

ICCPOL'06 Proceedings of the 21st international conference on Computer Processing of Oriental Languages: beyond the orient: the research challenges ahead
Category-pattern-based korean word-spacing

ICCPOL'06 Proceedings of the 21st international conference on Computer Processing of Oriental Languages: beyond the orient: the research challenges ahead
Unsupervised segmentation of chinese corpus using accessor variety

IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
An example-based study on chinese word segmentation using critical fragments

IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
A new method to compose long unknown Chinese keywords

Journal of Information Science
A classical Chinese corpus with nested part-of-speech tags

LaTeCH '12 Proceedings of the 6th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities
Incremental joint approach to word segmentation, POS tagging, and dependency parsing in Chinese

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Exploring deterministic constraints: from a constrained English POS tagger to an efficient ILP solution to Chinese word segmentation

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Phrase-based approach for adaptive tokenization

SIGMORPHON '12 Proceedings of the Twelfth Meeting of the Special Interest Group on Computational Morphology and Phonology
On the learnability of shuffle ideals

ALT'12 Proceedings of the 23rd international conference on Algorithmic Learning Theory
The application of kalman filter based human-computer learning model to chinese word segmentation

CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part I
How many multiword expressions do people know?

ACM Transactions on Speech and Language Processing (TSLP) - Special issue on multiword expressions: From theory to practice and use, part 1
On the learnability of shuffle ideals

The Journal of Machine Learning Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

The initial stage of text analysis for any NLP task usually involves the tokenization of the input into words. For languages like English one can assume, to a first approximation, that word boundaries are given by whitespace or punctuation. In various Asian languages, including Chinese, on the other hand, whitespace is never used to delimit words, so one must resort to lexical information to "reconstruct" the word-boundary information. In this paper we present a stochastic finite-state model wherein the basic workhorse is the weighted finite-state transducer. The model segments Chinese text into dictionary entries and words derived by various productive lexical processes, and---since the primary intended application of this model is to text-to-speech synthesis---provides pronunciations for these words. We evaluate the system's performance by comparing its segmentation "judgments" with the judgements of a pool of human segmenters, and the system is shown to perform quite well.