Document centered approach to text normalization

Authors:
Andrei Mikheev
Affiliations:
LTG, University of Edinburgh, 2 Buccleuch Place, Edinburgh EH8 9LW, UK
Venue:
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Year:
2000

Citing 11
Cited 13

A Cache-Based Natural Language Model for Speech Recognition

IEEE Transactions on Pattern Analysis and Machine Intelligence
One term or two?

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Identifying unknown proper names in newswire text

Corpus processing for lexical acquisition
Language Model Adaptation Using Mixtures and an Exponentially Decaying Cache

ICASSP '97 Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97)-Volume 2 - Volume 2
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
Adaptive multilingual sentence boundary disambiguation

Computational Linguistics
Automatic rule induction for unknown-word guessing

Computational Linguistics
A knowledge-free method for capitalized word disambiguation

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
MITRE: description of the Alembic system used for MUC-6

MUC6 '95 Proceedings of the 6th conference on Message understanding
Some applications of tree-based modelling to speech and language

HLT '89 Proceedings of the workshop on Speech and Natural Language
One sense per discourse

HLT '91 Proceedings of the workshop on Speech and Natural Language

Structured information retrieval in XML documents

Proceedings of the 2002 ACM symposium on Applied computing
Probabilistic question answering on the web

Proceedings of the 11th international conference on World Wide Web
Integrated multi-strategic Web document pre-processing for sentence and word boundary detection

Information Processing and Management: an International Journal
Formal Methods of Tokenization for Part-of-Speech Tagging

CICLing '02 Proceedings of the Third International Conference on Computational Linguistics and Intelligent Text Processing
Capitalization Recovery for Text

Information Retrieval Techniques for Speech Applications [this book is based on the workshop “Information Retrieval Techniques for Speech Applications”, held as part of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in New Orleans, USA, in September 2001].
Probabilistic question answering on the Web: Research Articles

Journal of the American Society for Information Science and Technology
Reversing controlled document authoring to normalize documents

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 2
Semi-supervised Maximum Entropy based approach to acronym and abbreviation normalization in medical texts

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Challenges and resources for evaluating geographical IR

Proceedings of the 2005 workshop on Geographic information retrieval
Discovery of implicit and explicit connections between people using email utterance

ECSCW'03 Proceedings of the eighth conference on European Conference on Computer Supported Cooperative Work
Word Particles Applied to Information Retrieval

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
On privacy preservation in text and document-based active learning for named entity recognition

Proceedings of the ACM first international workshop on Privacy and anonymity for very large databases
Rewriting the orthography of sms messages

Natural Language Engineering

Quantified Score

Hi-index	0.01

Visualization

Abstract

In this paper we present an approach to tackle three important problems of text normalization: sentence boundary disambiguation, disambiguation of capitalized words when they are used in positions where capitalization is expected, and identification of abbreviations. The main feature of our approach is that it uses a minimum of pre-built resources, instead dynamically inferring disambiguation clues from the entire document itself. This makes it domain independent, closely targeted to each individual document and portable to other languages. We thoroughly evaluated this approach on several corpora and it showed high accuracy.