Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach

Authors:
Jianfeng Gao;Mu Li;Andi Wu;Chang-Ning Huang
Affiliations:
-;-;-;-
Venue:
Computational Linguistics
Year:
2005

Citing 36
Cited 41

Compilers: principles, techniques, and tools

Compilers: principles, techniques, and tools
Modern mathematical statistics

Modern mathematical statistics
Chinese text segmentation for text retrieval: achievements and problems

Journal of the American Society for Information Science
The nature of statistical learning theory

The nature of statistical learning theory
Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging

Computational Linguistics
A maximum entropy approach to natural language processing

Computational Linguistics
A stochastic finite-state word-segmentation algorithm for Chinese

Computational Linguistics
Inducing Features of Random Fields

IEEE Transactions on Pattern Analysis and Machine Intelligence
PAT-tree-based keyword extraction for Chinese information retrieval

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
A study on word-based and integral-bit Chinese text compression algorithms

Journal of the American Society for Information Science
Foundations of statistical natural language processing

Foundations of statistical natural language processing
A new statistical formula for Chinese text segmentation incorporating contextual information

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Machine Learning

Machine Learning
Finite-State Language Processing

Finite-State Language Processing
Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms

Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms
Toward a unified approach to statistical language modeling for Chinese

ACM Transactions on Asian Language Information Processing (TALIP)
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
A compression-based algorithm for Chinese word segmentation

Computational Linguistics
Distribution of content words and phrases in text and language modelling

Natural Language Engineering
A trainable rule-based algorithm for word segmentation

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Chinese word segmentation without using lexicon and hand-crafted training data

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
Chinese named entity identification using class-based language model

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Minimum error rate training in statistical machine translation

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Improved source-channel models for Chinese word segmentation

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Distribution-based pruning of backoff language models

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Statistically-enhanced new word identification in a rule-based Chinese system

CLPW '00 Proceedings of the second workshop on Chinese language processing: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 12
Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Unsupervised training for overlapping ambiguity resolution in Chinese word segmentation

SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
Single character Chinese named entity recognition

SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
The first international Chinese word segmentation Bakeoff

SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
Chinese word segmentation using minimal linguistic knowledge

SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
Parameter estimation for statistical parsing models: theory and practice of distribution-free methods

New developments in parsing technology
Adaptive Chinese word segmentation

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Chinese segmentation and new word detection using conditional random fields

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Minimum sample risk methods for language modeling

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
The use of SVM for chinese new word identification

IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing

A study of statistical models for query translation: finding a good unit of translation

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Discriminative pruning of language models for Chinese word segmentation

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Statistical query translation models for cross-language information retrieval

ACM Transactions on Asian Language Information Processing (TALIP)
A search-based Chinese word segmentation method

Proceedings of the 16th international conference on World Wide Web
Analysis and repair of name tagger errors

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Chinese word segmentation as morpheme-based lexical chunking

Information Sciences: an International Journal
Applications of corpus-based semantic similarity and word segmentation to database schema matching

The VLDB Journal — The International Journal on Very Large Data Bases
A Hybrid Approach to Word Segmentation of Vietnamese Texts

Language and Automata Theory and Applications
Improved Monolingual Hypothesis Alignment for Machine Translation System Combination

ACM Transactions on Asian Language Information Processing (TALIP)
Combining Language Modeling and Discriminative Classification for Word Segmentation

CICLing '09 Proceedings of the 10th International Conference on Computational Linguistics and Intelligent Text Processing
A Generalized Approach to Word Segmentation Using Maximum Length Descending Frequency and Entropy Rate

CICLing '07 Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing
Bayesian semi-supervised Chinese word segmentation for statistical machine translation

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Empirical study on the performance stability of named entity recognition model across domains

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Domain adaptation with latent semantic association for named entity recognition

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Semi-supervised lexicon mining from parenthetical expressions in monolingual web pages

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Optimizing Chinese word segmentation for machine translation performance

StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
LoLo: a system based on terminology for multilingual extraction

IEBeyondDoc '06 Proceedings of the Workshop on Information Extraction Beyond The Document
Punctuation as implicit annotations for chinese word segmentation

Computational Linguistics
Chinese novelty mining

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3 - Volume 3
Exploiting query logs for cross-lingual query suggestions

ACM Transactions on Information Systems (TOIS)
A trigram statistical language model algorithm for Chinese word segmentation

FAW'07 Proceedings of the 1st annual international conference on Frontiers in algorithmics
A Unified Character-Based Tagging Framework for Chinese Word Segmentation

ACM Transactions on Asian Language Information Processing (TALIP)
Methodological Review: Text mining for traditional Chinese medical knowledge discovery: A survey

Journal of Biomedical Informatics
An Information-Extraction System for Urdu---A Resource-Poor Language

ACM Transactions on Asian Language Information Processing (TALIP)
Multilingual novelty detection

Expert Systems with Applications: An International Journal
Enhancing domain portability of Chinese segmentation model using chi-square statistics and bootstrapping

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
A hybrid Chinese information retrieval model

AMT'10 Proceedings of the 6th international conference on Active media technology
EagleEye: entity-centric business intelligence for smarter decisions

IBM Journal of Research and Development
Parsing the internal structure of words: a new paradigm for Chinese word segmentation

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Automatic Treebank Conversion via Informed Decoding - A Case Study on Chinese Treebanks

ACM Transactions on Asian Language Information Processing (TALIP)
Multilingual sentence categorization and novelty mining

Information Processing and Management: an International Journal
Chinese categorization and novelty mining

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part II
A new unsupervised approach to word segmentation

Computational Linguistics
A framework and its empirical study of automatic diagnosis of traditional Chinese medicine utilizing raw free-text clinical records

Journal of Biomedical Informatics
Integrating Generative and Discriminative Character-Based Models for Chinese Word Segmentation

ACM Transactions on Asian Language Information Processing (TALIP)
A GPU-Based accelerator for chinese word segmentation

APWeb'12 Proceedings of the 14th Asia-Pacific international conference on Web Technologies and Applications
A preliminary work on symptom name recognition from free-text clinical records of traditional chinese medicine using conditional random fields and reasonable features

BioNLP '12 Proceedings of the 2012 Workshop on Biomedical Natural Language Processing
An empirical study on word segmentation for chinese machine translation

CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume 2
A joint model to identify and align bilingual named entities

Computational Linguistics
Predicting part-of-speech tags and morpho-syntactic relations using similarity-based technique

SLSP'13 Proceedings of the First international conference on Statistical Language and Speech Processing
Supervised methods for symptom name recognition in free-text clinical records of traditional Chinese medicine: An empirical study

Journal of Biomedical Informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

This article presents a pragmatic approach to Chinese word segmentation. It differs from most previous approaches mainly in three respects. First, while theoretical linguists have defined Chinese words using various linguistic criteria, Chinese words in this study are defined pragmatically as segmentation units whose definition depends on how they are used and processed in realistic computer applications. Second, we propose a pragmatic mathematical framework in which segmenting known words and detecting unknown words of different types (i.e., morphologically derived words, factoids, named entities, and other unlisted words) can be performed simultaneously in a unified way. These tasks are usually conducted separately in other systems. Finally, we do not assume the existence of a universal word segmentation standard that is application-independent. Instead, we argue for the necessity of multiple segmentation standards due to the pragmatic fact that different natural language processing applications might require different granularities of Chinese words.These pragmatic approaches have been implemented in an adaptive Chinese word segmenter, called MSRSeg, which will be described in detail. It consists of two components: (1) a generic segmenter that is based on the framework of linear mixture models and provides a unified approach to the five fundamental features of word-level Chinese language processing: lexicon word processing, morphological analysis, factoid detection, named entity recognition, and new word identification; and (2) a set of output adaptors for adapting the output of (1) to different application-specific standards. Evaluation on five test sets with different standards shows that the adaptive system achieves state-of-the-art performance on all the test sets.