Adaptive multilingual sentence boundary disambiguation

Authors:
David D. Palmer;Marti A. Hearst
Affiliations:
The MITRE Corporation;Xerox PARC
Venue:
Computational Linguistics
Year:
1997

Citing 18
Cited 38

Review of neural networks for speech recognition

Neural Computation
Introduction to the theory of neural computation

Introduction to the theory of neural computation
C4.5: programs for machine learning

C4.5: programs for machine learning
Emergent linguistic rules from inducing decision trees: disambiguating discourse clue words

AAAI '94 Proceedings of the twelfth national conference on Artificial intelligence (vol. 1)
Corpus-driven knowledge acquisition for discourse analysis

AAAI '94 Proceedings of the twelfth national conference on Artificial intelligence (vol. 1)
Connectionist Speech Recognition: A Hybrid Approach

Connectionist Speech Recognition: A Hybrid Approach
Induction of Decision Trees

Machine Learning
A program for aligning sentences in bilingual corpora

Computational Linguistics - Special issue on using large corpora: I
Text-translation alignment

Computational Linguistics - Special issue on using large corpora: I
A stochastic parts program and noun phrase parser for unrestricted text

ANLC '88 Proceedings of the second conference on Applied natural language processing
Adaptive sentence boundary disambiguation

ANLC '94 Proceedings of the fourth conference on Applied natural language processing
A practical part-of-speech tagger

ANLC '92 Proceedings of the third conference on Applied natural language processing
Statistical decision-tree models for parsing

ACL '95 Proceedings of the 33rd annual meeting on Association for Computational Linguistics
Neural network approach to word category prediction for English texts

COLING '90 Proceedings of the 13th conference on Computational linguistics - Volume 3
K-vec: a new approach for aligning parallel texts

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 2
MITRE: description of the Alembic system used for MUC-6

MUC6 '95 Proceedings of the 6th conference on Message understanding
Some applications of tree-based modelling to speech and language

HLT '89 Proceedings of the workshop on Speech and Natural Language
Semantic classes and syntactic ambiguity

HLT '93 Proceedings of the workshop on Human Language Technology

Document centered approach to text normalization

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Seeing the whole in parts: text summarization for web browsing on handheld devices

Proceedings of the 10th international conference on World Wide Web
Efficient web browsing on handheld devices using page and form summarization

ACM Transactions on Information Systems (TOIS)
Learning-based Intrasentence Segmentation for Efficient Translation of Long Sentences

Machine Translation
Integrated multi-strategic Web document pre-processing for sentence and word boundary detection

Information Processing and Management: an International Journal
Periods, capitalized words, etc.

Computational Linguistics
Universal Segmentation of Text with the Sumo Formalism

NLP '00 Proceedings of the Second International Conference on Natural Language Processing
The rhetorical parsing of unrestricted texts: a surface-based approach

Computational Linguistics
A statistical information extraction system for Turkish

Natural Language Engineering
Language independent morphological analysis

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
Tagging sentence boundaries

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
A maximum entropy approach to identifying sentence boundaries

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Feature lattices for maximum entropy modelling

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
A formalism for universal segmentation of text

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 2
Automatic corpus-based Thai word extraction with the c4.5 learning algorithm

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 2
Robust document image understanding technologies

Proceedings of the 1st ACM workshop on Hardcopy document processing
A knowledge-free method for capitalized word disambiguation

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Deep Read: a reading comprehension system

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
A decision-based approach to rhetorical parsing

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Scaled log likelihood ratios for the detection of abbreviations in text corpora

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 2
Sentence level discourse parsing using syntactic and lexical information

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Email data cleaning

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Reducing parsing complexity by intra-sentence segmentation based on maximum entropy model

EMNLP '00 Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13
Summarization of noisy documents: a pilot study

HLT-NAACL-DUC '03 Proceedings of the HLT-NAACL 03 on Text summarization workshop - Volume 5
Broad coverage paragraph segmentation across languages and domains

ACM Transactions on Speech and Language Processing (TSLP)
Unsupervised Multilingual Sentence Boundary Detection

Computational Linguistics
Intra-sentence segmentation based on support vector machines in English-Korean machine translation systems

Expert Systems with Applications: An International Journal
Dialogue Based Text Editing

TSD '08 Proceedings of the 11th international conference on Text, Speech and Dialogue
ADROIT: automatic discourse relation organizer of internet-based text

AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 3
Sentence boundary detection and the problem with the U.S.

NAACL-Short '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers
Teaching applied natural language processing: triumphs and tribulations

TeachNLP '05 Proceedings of the Second ACL Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics
What did they do? Deriving high-level edit histories in Wikis

Proceedings of the 6th International Symposium on Wikis and Open Collaboration
Using SRX standard for sentence segmentation

LTC'09 Proceedings of the 4th conference on Human language technology: challenges for computer science and linguistics
Sentence boundary detection in turkish

ADVIS'04 Proceedings of the Third international conference on Advances in Information Systems
Syntactic analysis of long sentences based on s-clauses

IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
Multilingual sentence hunter

WISE'05 Proceedings of the 2005 international conference on Web Information Systems Engineering
A comparative evaluation of a new unsupervised sentence boundary detection approach on documents in english and portuguese

CICLing'06 Proceedings of the 7th international conference on Computational Linguistics and Intelligent Text Processing
Period disambiguation with maxent model

IJCNLP'05 Proceedings of the Second international joint conference on Natural Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The sentence is a standard textual unit in natual language processing applications. In many language the punctuation mark that indicates the end-of-sentence boundary is ambiguous; thus the tokenizers of most NLP systems must be equipped with special sentence boundary recognition rules for every new text collection.As an alternative, this article presents an efficient, trainable system for sentence boundary disambiguation. The system, called Satz, makes simple estimates of the parts of speech of the tokens immediately preceding and following each punctuation mark, and uses these estimates as input to a machine learning algorithm that then classifies the punctuation mark. Satz is very fast both in training and sentence analysis, and its combined robustness and accuracy surpass existing techniques. The system needs only a small lexicon and training corpus, and has been shown to transfer quickly and easily from English to other languages, as demonstrated on Franch and German.