Periods, capitalized words, etc.

Authors:
Andrei Mikheev
Affiliations:
Institute for Communicating and Collaborative Systems, Division of Informatics, 2 Buccleuch Place, Edinburgh EH8 9LW, UK
Venue:
Computational Linguistics
Year:
2002

Citing 19
Cited 14

A Cache-Based Natural Language Model for Speech Recognition

IEEE Transactions on Pattern Analysis and Machine Intelligence
One term or two?

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging

Computational Linguistics
Identifying unknown proper names in newswire text

Corpus processing for lexical acquisition
Language Model Adaptation Using Mixtures and an Exponentially Decaying Cache

ICASSP '97 Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97)-Volume 2 - Volume 2
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
Adaptive multilingual sentence boundary disambiguation

Computational Linguistics
Automatic rule induction for unknown-word guessing

Computational Linguistics
A stochastic parts program and noun phrase parser for unrestricted text

ANLC '88 Proceedings of the second conference on Applied natural language processing
Tagging sentence boundaries

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
Adaptive sentence boundary disambiguation

ANLC '94 Proceedings of the fourth conference on Applied natural language processing
A maximum entropy approach to identifying sentence boundaries

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Nymble: a high-performance learning name-finder

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Unsupervised word sense disambiguation rivaling supervised methods

ACL '95 Proceedings of the 33rd annual meeting on Association for Computational Linguistics
A knowledge-free method for capitalized word disambiguation

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
MITRE: description of the Alembic system used for MUC-6

MUC6 '95 Proceedings of the 6th conference on Message understanding
Some applications of tree-based modelling to speech and language

HLT '89 Proceedings of the workshop on Speech and Natural Language
One sense per discourse

HLT '91 Proceedings of the workshop on Speech and Natural Language
One sense per collocation

HLT '93 Proceedings of the workshop on Human Language Technology

Automatic summarisation of legal documents

ICAIL '03 Proceedings of the 9th international conference on Artificial intelligence and law
Email data cleaning

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Summarising legal texts: sentential tense and argumentative roles

HLT-NAACL-DUC '03 Proceedings of the HLT-NAACL 03 on Text summarization workshop - Volume 5
Unsupervised Multilingual Sentence Boundary Detection

Computational Linguistics
Integrated scoring for spelling error correction, abbreviation expansion and case restoration in dirty text

AusDM '06 Proceedings of the fifth Australasian conference on Data mining and analystics - Volume 61
Intra-sentence segmentation based on support vector machines in English-Korean machine translation systems

Expert Systems with Applications: An International Journal
Dialogue Based Text Editing

TSD '08 Proceedings of the 11th international conference on Text, Speech and Dialogue
Recovering capitalization and punctuation marks for automatic speech recognition: Case study for Portuguese broadcast news

Speech Communication
Studying the effects of noisy text on text mining applications

Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data
Identifying interaction sentences from biological literature using automatically extracted patterns

BioNLP '09 Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing
PNEPs, NEPs for Context Free Parsing: Application to Natural Language Processing

IWANN '09 Proceedings of the 10th International Work-Conference on Artificial Neural Networks: Part I: Bio-Inspired Systems: Computational and Ambient Intelligence
Sentence boundary detection and the problem with the U.S.

NAACL-Short '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers
Sentence identification of biological interactions using PATRICIA tree generated patterns and genetic algorithm optimized parameters

Data & Knowledge Engineering
Contractions: breaking the tokenization-tagging circularity

PROPOR'03 Proceedings of the 6th international conference on Computational processing of the Portuguese language

Quantified Score

Hi-index	0.01

Visualization

Abstract

In this article we present an approach for tackling three important aspects of text normalization: sentence boundary disambiguation, disambiguation of capitalized words in positions where capitalization is expected, and identification of abbreviations. As opposed to the two dominant techniques of computing statistics or writing specialized grammars, our document-centered approach works by considering suggestive local contexts and repetitions of individual words within a document. This approach proved to be robust to domain shifts and new lexica and produced performance on the level with the highest reported results. When incorporated into a part-of-speech tagger, it helped reduce the error rate significantly on capitalized words and sentence boundaries. We also investigated the portability to other languages and obtained encouraging results.