A Cache-Based Natural Language Model for Speech Recognition
IEEE Transactions on Pattern Analysis and Machine Intelligence
SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Identifying unknown proper names in newswire text
Corpus processing for lexical acquisition
Language Model Adaptation Using Mixtures and an Exponentially Decaying Cache
ICASSP '97 Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97)-Volume 2 - Volume 2
Building a large annotated corpus of English: the penn treebank
Computational Linguistics - Special issue on using large corpora: II
Adaptive multilingual sentence boundary disambiguation
Computational Linguistics
Automatic rule induction for unknown-word guessing
Computational Linguistics
A stochastic parts program and noun phrase parser for unrestricted text
ANLC '88 Proceedings of the second conference on Applied natural language processing
NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
Adaptive sentence boundary disambiguation
ANLC '94 Proceedings of the fourth conference on Applied natural language processing
A maximum entropy approach to identifying sentence boundaries
ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Nymble: a high-performance learning name-finder
ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Unsupervised word sense disambiguation rivaling supervised methods
ACL '95 Proceedings of the 33rd annual meeting on Association for Computational Linguistics
A knowledge-free method for capitalized word disambiguation
ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
MITRE: description of the Alembic system used for MUC-6
MUC6 '95 Proceedings of the 6th conference on Message understanding
Some applications of tree-based modelling to speech and language
HLT '89 Proceedings of the workshop on Speech and Natural Language
HLT '91 Proceedings of the workshop on Speech and Natural Language
HLT '93 Proceedings of the workshop on Human Language Technology
Automatic summarisation of legal documents
ICAIL '03 Proceedings of the 9th international conference on Artificial intelligence and law
Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Summarising legal texts: sentential tense and argumentative roles
HLT-NAACL-DUC '03 Proceedings of the HLT-NAACL 03 on Text summarization workshop - Volume 5
Unsupervised Multilingual Sentence Boundary Detection
Computational Linguistics
AusDM '06 Proceedings of the fifth Australasian conference on Data mining and analystics - Volume 61
Expert Systems with Applications: An International Journal
TSD '08 Proceedings of the 11th international conference on Text, Speech and Dialogue
Studying the effects of noisy text on text mining applications
Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data
Identifying interaction sentences from biological literature using automatically extracted patterns
BioNLP '09 Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing
PNEPs, NEPs for Context Free Parsing: Application to Natural Language Processing
IWANN '09 Proceedings of the 10th International Work-Conference on Artificial Neural Networks: Part I: Bio-Inspired Systems: Computational and Ambient Intelligence
Sentence boundary detection and the problem with the U.S.
NAACL-Short '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers
Contractions: breaking the tokenization-tagging circularity
PROPOR'03 Proceedings of the 6th international conference on Computational processing of the Portuguese language
Hi-index | 0.01 |
In this article we present an approach for tackling three important aspects of text normalization: sentence boundary disambiguation, disambiguation of capitalized words in positions where capitalization is expected, and identification of abbreviations. As opposed to the two dominant techniques of computing statistics or writing specialized grammars, our document-centered approach works by considering suggestive local contexts and repetitions of individual words within a document. This approach proved to be robust to domain shifts and new lexica and produced performance on the level with the highest reported results. When incorporated into a part-of-speech tagger, it helped reduce the error rate significantly on capitalized words and sentence boundaries. We also investigated the portability to other languages and obtained encouraging results.