Sentence boundary detection and the problem with the U.S.

Authors:
Dan Gillick
Affiliations:
University of California, Berkeley
Venue:
NAACL-Short '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers
Year:
2009

Citing 8
Cited 7

Making large-scale support vector machine learning practical

Advances in kernel methods
Periods, capitalized words, etc.

Computational Linguistics
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
Adaptive multilingual sentence boundary disambiguation

Computational Linguistics
A maximum entropy approach to identifying sentence boundaries

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
MITRE: description of the Alembic system used for MUC-6

MUC6 '95 Proceedings of the 6th conference on Message understanding
NLTK: the Natural Language Toolkit

ETMTNLP '02 Proceedings of the ACL-02 Workshop on Effective tools and methodologies for teaching natural language processing and computational linguistics - Volume 1
Unsupervised Multilingual Sentence Boundary Detection

Computational Linguistics

Analysis of discourse structure with syntactic dependencies and data-driven shift-reduce parsing

IWPT '09 Proceedings of the 11th International Conference on Parsing Technologies
Say Anything: Using Textual Case-Based Reasoning to Enable Open-Domain Interactive Storytelling

ACM Transactions on Interactive Intelligent Systems (TiiS) - Special Issue on Common Sense for Interactive Systems
Toward developing a very big sign language parallel corpus

ICCHP'12 Proceedings of the 13th international conference on Computers Helping People with Special Needs - Volume Part II
Using discourse information for paraphrase extraction

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Annotated Gigaword

AKBC-WEKEX '12 Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction
Sub-sentence extraction based on combinatorial optimization

ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval
Multi-document text summarization using topic model and fuzzy logic

MLDM'13 Proceedings of the 9th international conference on Machine Learning and Data Mining in Pattern Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

Sentence Boundary Detection is widely used but often with outdated tools. We discuss what makes it difficult, which features are relevant, and present a fully statistical system, now publicly available, that gives the best known error rate on a standard news corpus: Of some 27,000 examples, our system makes 67 errors, 23 involving the word "U.S."