Updating an NLP system to fit new domains: an empirical study on the sentence segmentation problem

Authors:
Tong Zhang;Fred Damerau;David Johnson
Affiliations:
IBM T.J. Watson Research Center, Yorktown Heights, New York;IBM T.J. Watson Research Center, Yorktown Heights, New York;IBM T.J. Watson Research Center, Yorktown Heights, New York
Venue:
CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Year:
2003

Citing 9
Cited 0

Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm

Machine Learning
Improving Short-Text Classification using Unlabeled Data for Classification Problems

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Relational Learning for NLP using Linear Threshold Elements

IJCAI '99 Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence
Text chunking based on a generalization of winnow

The Journal of Machine Learning Research
Tagging English text with a probabilistic model

Computational Linguistics
Tagging sentence boundaries

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
A maximum entropy approach to identifying sentence boundaries

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Active learning for statistical natural language parsing

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Example selection for bootstrapping statistical parsers

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1

Quantified Score

Hi-index	0.00

Visualization

Abstract

Statistical machine learning algorithms have been successfully applied to many natural language processing (NLP) problems. Compared to manually constructed systems, statistical NLP systems are often easier to develop and maintain since only annotated training text is required. From annotated data, the underlying statistical algorithm can build a model so that annotations for future data can be predicted. However, the performance of a statistical system can also depend heavily on the characteristics of the training data. If we apply such a system to text with characteristics different from that of the training data, then performance degradation will occur. In this paper, we examine this issue empirically using the sentence boundary detection problem. We propose and compare several methods that can be used to update a statistical NLP system when moving to a different domain.