Chinese sentence segmentation as comma classification

Authors:
Nianwen Xue;Yaqin Yang
Affiliations:
Brandeis University, Waltham, MA;Brandeis University, Waltham, MA
Venue:
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Year:
2011

Citing 6
Cited 4

Procedure for quantitatively comparing the syntactic coverage of English grammars

HLT '91 Proceedings of the workshop on Speech and Natural Language
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
A maximum entropy approach to identifying sentence boundaries

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
The Penn Chinese TreeBank: Phrase structure annotation of a large corpus

Natural Language Engineering
A Linguistically Inspired Statistical Model for Chinese Punctuation Generation

ACM Transactions on Asian Language Information Processing (TALIP)
Better punctuation prediction with dynamic conditional random fields

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

Chinese comma disambiguation for discourse analysis

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Elementary discourse unit in chinese discourse structure analysis

CLSW'12 Proceedings of the 13th Chinese conference on Chinese Lexical Semantics
A chinese sentence segmentation approach based on comma

CLSW'12 Proceedings of the 13th Chinese conference on Chinese Lexical Semantics
A clause-level hybrid approach to Chinese empty element recovery

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

We describe a method for disambiguating Chinese commas that is central to Chinese sentence segmentation. Chinese sentence segmentation is viewed as the detection of loosely coordinated clauses separated by commas. Trained and tested on data derived from the Chinese Treebank, our model achieves a classification accuracy of close to 90% overall, which translates to an F1 score of 70% for detecting commas that signal sentence boundaries.