Adaptive sentence boundary disambiguation

  • Authors:
  • David D. Palmer;Marti A. Hearst

  • Affiliations:
  • University of California, Berkeley, Berkeley, CA;Xerox PARC, Palo Alto, CA

  • Venue:
  • ANLC '94 Proceedings of the fourth conference on Applied natural language processing
  • Year:
  • 1994

Quantified Score

Hi-index 0.00

Visualization

Abstract

Labeling of sentence boundaries is a necessary prerequisite for many natural language processing tasks, including part-of-speech tagging and sentence alignment. End-of-sentence punctuation marks are ambiguous; to disambiguate them most systems use brittle, special-purpose regular expression grammars and exception rules. As an alternative, we have developed an efficient, trainable algorithm that uses a lexicon with part-of-speech probabilities and a feed-forward neural network. This work demonstrates the feasibility of using prior probabilities of part-of-speech assignments, as opposed to words or definite part-of-speech assignments, as contextual information. After training for less than one minute, the method correctly labels over 98.5% of sentence boundaries in a corpus of over 27,000 sentence-boundary marks. We show the method to be efficient and easily adaptable to different text genres, including single-case texts.