A maximum-entropy chinese parser augmented by transformation-based learning

Authors:
Pascale Fung;Grace Ngai;Yongsheng Yang;Benfeng Chen
Affiliations:
Hong Kong University of Science and Technology, Hong Kong;Hong Kong Polytechnic University, Hong Kong;Hong Kong University of Science and Technology, Hong Kong;Hong Kong University of Science and Technology, Hong Kong
Venue:
ACM Transactions on Asian Language Information Processing (TALIP)
Year:
2004

Citing 11
Cited 5

Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging

Computational Linguistics
A stochastic finite-state word-segmentation algorithm for Chinese

Computational Linguistics
Maximum entropy models for natural language ambiguity resolution

Maximum entropy models for natural language ambiguity resolution
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
Improving Chinese tokenization with linguistic filters on statistical lexical acquisition

ANLC '94 Proceedings of the fourth conference on Applied natural language processing
A trainable rule-based algorithm for word segmentation

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Is it harder to parse Chinese, or the Chinese Treebank?

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Two statistical parsing models applied to the Chinese Treebank

CLPW '00 Proceedings of the second workshop on Chinese language processing: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 12
Multidimensional transformation-based learning

ConLL '01 Proceedings of the 2001 workshop on Computational Natural Language Learning - Volume 7
A maximum entropy Chinese character-based parser

EMNLP '03 Proceedings of the 2003 conference on Empirical methods in natural language processing
A statistical parser for Chinese

HLT '02 Proceedings of the second international conference on Human Language Technology Research

A fast, accurate deterministic parser for Chinese

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
TBL-improved non-deterministic segmentation and POS tagging for a Chinese parser

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Semantic roles for SMT: a hybrid two-pass model

NAACL-Short '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers
Parsing the internal structure of words: a new paradigm for Chinese word segmentation

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
A machine learning parser using an unlexicalized distituent model

CICLing'10 Proceedings of the 11th international conference on Computational Linguistics and Intelligent Text Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Parsing, the task of identifying syntactic components, e.g., noun and verb phrases, in a sentence, is one of the fundamental tasks in natural language processing. Many natural language applications such as spoken-language understanding, machine translation, and information extraction, would benefit from, or even require, high accuracy parsing as a preprocessing step. Even though most state-of-the-art statistical parsers were initially constructed for parsing in English, most of them are not language-specific, in that they do not rely on properties of the language that are specific to English. Therefore, construction of a parser in a given language becomes a matter of retraining the statistical parameters with a Treebank in the corresponding language. The development of the Chinese treebank [Xia et al. 2000] spurred the construction of parsers for Chinese. However, Chinese as a language poses some unique problems for the development of a statistical parser, the most apparent being word segmentation. Since words in written Chinese are not delimited in the same way as in Western languages, the first problem that needs to be solved before an existing statistical method can be applied to Chinese is to identify the word boundaries. This is a step that is neglected by most pre-existing Chinese parsers, which assume that the input data has already been pre-segmented. This article describes a character-based statistical parser, which gives the best performance to-date on the Chinese treebank data. We augment an existing maximum entropy parser with transformation-based learning, creating a parser that can operate at the character level. We present experiments that show that our parser achieves results that are close to those achievable under perfect word segmentation conditions.