Chinese word segmentation without using lexicon and hand-crafted training data

Authors:
Sun Maosong;Shen Dayang;Benjamin K. Tsou
Affiliations:
Tsinghua University, Beijing, China;Shantou University, Guangdong, China;City University of Hong Kong, Hong Kong
Venue:
COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
Year:
1998

Citing 4
Cited 21

CSeg& Tag1.0: a practical word segmenter and POS tagger for Chinese texts

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
A trainable rule-based algorithm for word segmentation

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
A stochastic finite-state word-segmentation algorithm for Chinese

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
Word identification for Mandarin Chinese sentences

COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 1

Korean text summarization using an aggregate similarity

IRAL '00 Proceedings of the fifth international workshop on on Information retrieval with Asian languages
A Statistical Corpus-Based Term Extractor

AI '01 Proceedings of the 14th Biennial Conference of the Canadian Society on Computational Studies of Intelligence: Advances in Artificial Intelligence
Mostly-unsupervised statistical segmentation of Japanese Kanji sequences

Natural Language Engineering
Mostly-unsupervised statistical segmentation of Japanese: applications to kanji

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
Text summarization using a trainable summarizer and latent semantic analysis

Information Processing and Management: an International Journal - Special issue: An Asian digital libraries perspective
Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach

Computational Linguistics
Using co-occurrence statistics as an information source for partial parsing of Chinese

CLPW '00 Proceedings of the second workshop on Chinese language processing: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 12
A word segmentation method with dynamic adapting to text using inductive learning

SIGHAN '02 Proceedings of the first SIGHAN workshop on Chinese language processing - Volume 18
A bottom-up merging algorithm for Chinese unknown word extraction

SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
Contextual dependencies in unsupervised word segmentation

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Discursive usage of six Chinese punctuation marks

COLING ACL '06 Proceedings of the 21st International Conference on computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop
Punctuation as implicit annotations for chinese word segmentation

Computational Linguistics
A Unified Character-Based Tagging Framework for Chinese Word Segmentation

ACM Transactions on Asian Language Information Processing (TALIP)
Integrating unsupervised and supervised word segmentation: The role of goodness measures

Information Sciences: an International Journal
Incremental Chinese lexicon extraction with minimal resources on a domain-specific corpus

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Word frequency approximation for chinese using raw, MM-Segmented and manually segmented corpora

ICCPOL'06 Proceedings of the 21st international conference on Computer Processing of Oriental Languages: beyond the orient: the research challenges ahead
Unsupervised segmentation of chinese corpus using accessor variety

IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
Chinese new word finding using character-based parsing model

IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
Spoken versus written queries for mobile information access: an experiment on Mandarin Chinese

IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
Word frequency approximation for chinese without using manually-annotated corpus

CICLing'06 Proceedings of the 7th international conference on Computational Linguistics and Intelligent Text Processing
Word segmentation and POS tagging for chinese keyphrase extraction

ADMA'05 Proceedings of the First international conference on Advanced Data Mining and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Chinese word segmentation is the first step in any Chinese NLP system. This paper presents a new algorithm for segmenting Chinese texts without making use of any lexicon and hand-crafted linguistic resource. The statistical data required by the algorithm, that is, mutual information and the difference of t-score between characters, is derived automatically from raw Chinese corpora. The preliminary experiment shows that the segmentation accuracy of our algorithm is acceptable. We hope the gaining of this approach will be beneficial to improving the performance (especially in ability to cope with unknown words and ability to adapt to various domains) of the existing segmenters, though the algorithm itself can also be utilized as a stand-alone segmenter in some NLP applications.