Using Compression to Identify Acronyms in Text

Authors:
Stuart Yeates;David Bainbridge;Ian H. Witten
Affiliations:
-;-;-
Venue:
DCC '00 Proceedings of the Conference on Data Compression
Year:
2000

Citing 1
Cited 18

Using Compression to Identify Acronyms in Text

DCC '00 Proceedings of the Conference on Data Compression

Acrophile: an automated acronym extractor and server

DL '00 Proceedings of the fifth ACM conference on Digital libraries
Power to the people: end-user building of digital library collections

Proceedings of the 1st ACM/IEEE-CS joint conference on Digital libraries
Importing Documents and Metadata into Digital Libraries: Requirements Analysis and an Extensible Architecture

ECDL '02 Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries
Learning Structure from Sequences, with Applications in a Digital Library

ALT '02 Proceedings of the 13th International Conference on Algorithmic Learning Theory
Using Compression to Identify Acronyms in Text

DCC '00 Proceedings of the Conference on Data Compression
A Literature Based Method for Identifying Gene-Disease Connections

CSB '02 Proceedings of the IEEE Computer Society Conference on Bioinformatics
Tag Insertion Complexity

DCC '01 Proceedings of the Data Compression Conference
Combining PPM Models Using A Text Mining Approach

DCC '01 Proceedings of the Data Compression Conference
Automatic acquisition of long-distance acronym definitions

Design and application of hybrid intelligent systems
Abbreviation Expansion in Schema Matching and Web Integration

WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence
A large scale, corpus-based approach for automatically disambiguating biomedical abbreviations

ACM Transactions on Information Systems (TOIS)
Orthographic Errors in Web Pages: Toward Cleaner Web Corpora

Computational Linguistics
Using MEDLINE as a knowledge source for disambiguating abbreviations and acronyms in full-text biomedical journal articles

Journal of Biomedical Informatics
Identification of gene function using prediction by partial matching (PPM) language models

Proceedings of the 17th ACM conference on Information and knowledge management
Selected operations and applications of n-tape weighted finite-state machines

FSMNLP'09 Proceedings of the 8th international conference on Finite-state methods and natural language processing
Schema label normalization for improving schema matching

Data & Knowledge Engineering
Managing personal documents with a digital library

ECDL'05 Proceedings of the 9th European conference on Research and Advanced Technology for Digital Libraries
High-recall extraction of acronym-definition pairs with relevance feedback

Proceedings of the 2012 Joint EDBT/ICDT Workshops

Quantified Score

Hi-index	0.00

Visualization

Abstract

Finding acronyms and their definitions in free text is useful for many purposes. Previous acronym definition finders relied heavily on heuristic methods. In contrast, we have developed a new method that uses several PPM models to encode the acronym in terms of its definition. Four different attributes of each acronym are encoded using a PPMD order 5 model: (a) whether the acronym occurred before or after its definition (direction) (b) the distance between the acronym and the definition (first-word offset) (c) the pattern of words in the definition with letters in the acronym (subsequent-word offsets) and (d) the number of letters taken from each of those words.These models, taken together, give a complete encoding of the acronym in terms of its definition. The models were trained on 1080 acronyms extracted from 150 documents. A model of plain text was trained using 100 independent documents from the same collection. Each word of the testing data was considered to be an acronym if it could be constructed from the initial letters of the 16 words on either side. If the word could be composed in more than one way, each possibility was considered. The entropy of the four attributes is calculated, the entropy of the word as normal text. The ratio of the two is compared to a threshold to determine whether to declare the word an acronym.The new method is shown to outperform existing heuristic methods for acronym extraction. Figure 1 shows the results of separate experiments performed with acronyms of two or more letters and acronyms of three or more letters. The difference between the two recall--precision curves can be explained by the higher probability that a given non-acronym, two-letter, word can be formed from the words around it.