Compilers: principles, techniques, and tools
Compilers: principles, techniques, and tools
Modern mathematical statistics
Modern mathematical statistics
Chinese text segmentation for text retrieval: achievements and problems
Journal of the American Society for Information Science
The nature of statistical learning theory
The nature of statistical learning theory
A maximum entropy approach to natural language processing
Computational Linguistics
A stochastic finite-state word-segmentation algorithm for Chinese
Computational Linguistics
Inducing Features of Random Fields
IEEE Transactions on Pattern Analysis and Machine Intelligence
PAT-tree-based keyword extraction for Chinese information retrieval
Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
A study on word-based and integral-bit Chinese text compression algorithms
Journal of the American Society for Information Science
Foundations of statistical natural language processing
Foundations of statistical natural language processing
A new statistical formula for Chinese text segmentation incorporating contextual information
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Machine Learning
Finite-State Language Processing
Finite-State Language Processing
Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms
Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms
Toward a unified approach to statistical language modeling for Chinese
ACM Transactions on Asian Language Information Processing (TALIP)
Pattern Classification (2nd Edition)
Pattern Classification (2nd Edition)
A compression-based algorithm for Chinese word segmentation
Computational Linguistics
Distribution of content words and phrases in text and language modelling
Natural Language Engineering
A trainable rule-based algorithm for word segmentation
ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Chinese word segmentation without using lexicon and hand-crafted training data
COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
Chinese named entity identification using class-based language model
COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Minimum error rate training in statistical machine translation
ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Improved source-channel models for Chinese word segmentation
ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Distribution-based pruning of backoff language models
ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Statistically-enhanced new word identification in a rule-based Chinese system
CLPW '00 Proceedings of the second workshop on Chinese language processing: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 12
EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Unsupervised training for overlapping ambiguity resolution in Chinese word segmentation
SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
Single character Chinese named entity recognition
SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
The first international Chinese word segmentation Bakeoff
SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
Chinese word segmentation using minimal linguistic knowledge
SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
New developments in parsing technology
Adaptive Chinese word segmentation
ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Chinese segmentation and new word detection using conditional random fields
COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Minimum sample risk methods for language modeling
HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
The use of SVM for chinese new word identification
IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
A study of statistical models for query translation: finding a good unit of translation
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Discriminative pruning of language models for Chinese word segmentation
ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Statistical query translation models for cross-language information retrieval
ACM Transactions on Asian Language Information Processing (TALIP)
A search-based Chinese word segmentation method
Proceedings of the 16th international conference on World Wide Web
Analysis and repair of name tagger errors
COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Chinese word segmentation as morpheme-based lexical chunking
Information Sciences: an International Journal
Applications of corpus-based semantic similarity and word segmentation to database schema matching
The VLDB Journal — The International Journal on Very Large Data Bases
A Hybrid Approach to Word Segmentation of Vietnamese Texts
Language and Automata Theory and Applications
Improved Monolingual Hypothesis Alignment for Machine Translation System Combination
ACM Transactions on Asian Language Information Processing (TALIP)
Combining Language Modeling and Discriminative Classification for Word Segmentation
CICLing '09 Proceedings of the 10th International Conference on Computational Linguistics and Intelligent Text Processing
CICLing '07 Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing
Bayesian semi-supervised Chinese word segmentation for statistical machine translation
COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Empirical study on the performance stability of named entity recognition model across domains
EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Domain adaptation with latent semantic association for named entity recognition
NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Semi-supervised lexicon mining from parenthetical expressions in monolingual web pages
NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Optimizing Chinese word segmentation for machine translation performance
StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
LoLo: a system based on terminology for multilingual extraction
IEBeyondDoc '06 Proceedings of the Workshop on Information Extraction Beyond The Document
Punctuation as implicit annotations for chinese word segmentation
Computational Linguistics
EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3 - Volume 3
Exploiting query logs for cross-lingual query suggestions
ACM Transactions on Information Systems (TOIS)
A trigram statistical language model algorithm for Chinese word segmentation
FAW'07 Proceedings of the 1st annual international conference on Frontiers in algorithmics
A Unified Character-Based Tagging Framework for Chinese Word Segmentation
ACM Transactions on Asian Language Information Processing (TALIP)
Methodological Review: Text mining for traditional Chinese medical knowledge discovery: A survey
Journal of Biomedical Informatics
An Information-Extraction System for Urdu---A Resource-Poor Language
ACM Transactions on Asian Language Information Processing (TALIP)
Multilingual novelty detection
Expert Systems with Applications: An International Journal
EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
A hybrid Chinese information retrieval model
AMT'10 Proceedings of the 6th international conference on Active media technology
EagleEye: entity-centric business intelligence for smarter decisions
IBM Journal of Research and Development
Parsing the internal structure of words: a new paradigm for Chinese word segmentation
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Automatic Treebank Conversion via Informed Decoding - A Case Study on Chinese Treebanks
ACM Transactions on Asian Language Information Processing (TALIP)
Multilingual sentence categorization and novelty mining
Information Processing and Management: an International Journal
Chinese categorization and novelty mining
PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part II
A new unsupervised approach to word segmentation
Computational Linguistics
Journal of Biomedical Informatics
Integrating Generative and Discriminative Character-Based Models for Chinese Word Segmentation
ACM Transactions on Asian Language Information Processing (TALIP)
A GPU-Based accelerator for chinese word segmentation
APWeb'12 Proceedings of the 14th Asia-Pacific international conference on Web Technologies and Applications
BioNLP '12 Proceedings of the 2012 Workshop on Biomedical Natural Language Processing
An empirical study on word segmentation for chinese machine translation
CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume 2
A joint model to identify and align bilingual named entities
Computational Linguistics
Predicting part-of-speech tags and morpho-syntactic relations using similarity-based technique
SLSP'13 Proceedings of the First international conference on Statistical Language and Speech Processing
Journal of Biomedical Informatics
Hi-index | 0.00 |
This article presents a pragmatic approach to Chinese word segmentation. It differs from most previous approaches mainly in three respects. First, while theoretical linguists have defined Chinese words using various linguistic criteria, Chinese words in this study are defined pragmatically as segmentation units whose definition depends on how they are used and processed in realistic computer applications. Second, we propose a pragmatic mathematical framework in which segmenting known words and detecting unknown words of different types (i.e., morphologically derived words, factoids, named entities, and other unlisted words) can be performed simultaneously in a unified way. These tasks are usually conducted separately in other systems. Finally, we do not assume the existence of a universal word segmentation standard that is application-independent. Instead, we argue for the necessity of multiple segmentation standards due to the pragmatic fact that different natural language processing applications might require different granularities of Chinese words.These pragmatic approaches have been implemented in an adaptive Chinese word segmenter, called MSRSeg, which will be described in detail. It consists of two components: (1) a generic segmenter that is based on the framework of linear mixture models and provides a unified approach to the five fundamental features of word-level Chinese language processing: lexicon word processing, morphological analysis, factoid detection, named entity recognition, and new word identification; and (2) a set of output adaptors for adapting the output of (1) to different application-specific standards. Evaluation on five test sets with different standards shows that the adaptive system achieves state-of-the-art performance on all the test sets.