Sentence boundary detection and the problem with the U.S.

  • Authors:
  • Dan Gillick

  • Affiliations:
  • University of California, Berkeley

  • Venue:
  • NAACL-Short '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Sentence Boundary Detection is widely used but often with outdated tools. We discuss what makes it difficult, which features are relevant, and present a fully statistical system, now publicly available, that gives the best known error rate on a standard news corpus: Of some 27,000 examples, our system makes 67 errors, 23 involving the word "U.S."