Adaptive sentence boundary disambiguation

Authors:
David D. Palmer;Marti A. Hearst
Affiliations:
University of California, Berkeley, Berkeley, CA;Xerox PARC, Palo Alto, CA
Venue:
ANLC '94 Proceedings of the fourth conference on Applied natural language processing
Year:
1994

Citing 6
Cited 20

Introduction to the theory of neural computation

Introduction to the theory of neural computation
A program for aligning sentences in bilingual corpora

Computational Linguistics - Special issue on using large corpora: I
Text-translation alignment

Computational Linguistics - Special issue on using large corpora: I
A stochastic parts program and noun phrase parser for unrestricted text

ANLC '88 Proceedings of the second conference on Applied natural language processing
A practical part-of-speech tagger

ANLC '92 Proceedings of the third conference on Applied natural language processing
Some applications of tree-based modelling to speech and language

HLT '89 Proceedings of the workshop on Speech and Natural Language

Building a scalable and accurate copy detection mechanism

Proceedings of the first ACM international conference on Digital libraries
Periods, capitalized words, etc.

Computational Linguistics
Automatic Structuring of Written Texts

TSD '99 Proceedings of the Second International Workshop on Text, Speech and Dialogue
Mining free text for structure

Data mining
Adaptive multilingual sentence boundary disambiguation

Computational Linguistics
Experiments on sentence boundary detection

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
Advances in domain independent linear text segmentation

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
High performance segmentation of spontaneous speech using part of speech and trigger word information

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
A maximum entropy approach to identifying sentence boundaries

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Regular expressions for language engineering

Natural Language Engineering
Comma restoration using constituency information

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Improving translation quality of rule-based machine translation

COLING-MTIA '02 Proceedings of the 2002 COLING workshop on Machine translation in Asia - Volume 16
Tagging Sentence Boundaries in Biomedical Literature

CICLing '07 Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing
Constructing lexicon with morpho-syntactic features from untagged corpora

ECC'09 Proceedings of the 3rd international conference on European computing conference
Chinese utterance segmentation in spoken language translation

CICLing'03 Proceedings of the 4th international conference on Computational linguistics and intelligent text processing
Using support vector machines for terrorism information extraction

ISI'03 Proceedings of the 1st NSF/NIJ conference on Intelligence and security informatics
Detecting sentence boundaries in japanese speech transcriptions using a morphological analyzer

IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
A case study of using web search statistics: case restoration

CICLing'10 Proceedings of the 11th international conference on Computational Linguistics and Intelligent Text Processing
A chinese sentence segmentation approach based on comma

CLSW'12 Proceedings of the 13th Chinese conference on Chinese Lexical Semantics
Relevant learning objects extraction based on semantic annotation

International Journal of Metadata, Semantics and Ontologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

Labeling of sentence boundaries is a necessary prerequisite for many natural language processing tasks, including part-of-speech tagging and sentence alignment. End-of-sentence punctuation marks are ambiguous; to disambiguate them most systems use brittle, special-purpose regular expression grammars and exception rules. As an alternative, we have developed an efficient, trainable algorithm that uses a lexicon with part-of-speech probabilities and a feed-forward neural network. This work demonstrates the feasibility of using prior probabilities of part-of-speech assignments, as opposed to words or definite part-of-speech assignments, as contextual information. After training for less than one minute, the method correctly labels over 98.5% of sentence boundaries in a corpus of over 27,000 sentence-boundary marks. We show the method to be efficient and easily adaptable to different text genres, including single-case texts.