Using Compression to Identify Acronyms in Text
DCC '00 Proceedings of the Conference on Data Compression
Acrophile: an automated acronym extractor and server
DL '00 Proceedings of the fifth ACM conference on Digital libraries
Power to the people: end-user building of digital library collections
Proceedings of the 1st ACM/IEEE-CS joint conference on Digital libraries
ECDL '02 Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries
Learning Structure from Sequences, with Applications in a Digital Library
ALT '02 Proceedings of the 13th International Conference on Algorithmic Learning Theory
Using Compression to Identify Acronyms in Text
DCC '00 Proceedings of the Conference on Data Compression
A Literature Based Method for Identifying Gene-Disease Connections
CSB '02 Proceedings of the IEEE Computer Society Conference on Bioinformatics
DCC '01 Proceedings of the Data Compression Conference
Combining PPM Models Using A Text Mining Approach
DCC '01 Proceedings of the Data Compression Conference
Automatic acquisition of long-distance acronym definitions
Design and application of hybrid intelligent systems
Abbreviation Expansion in Schema Matching and Web Integration
WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence
A large scale, corpus-based approach for automatically disambiguating biomedical abbreviations
ACM Transactions on Information Systems (TOIS)
Orthographic Errors in Web Pages: Toward Cleaner Web Corpora
Computational Linguistics
Journal of Biomedical Informatics
Identification of gene function using prediction by partial matching (PPM) language models
Proceedings of the 17th ACM conference on Information and knowledge management
Selected operations and applications of n-tape weighted finite-state machines
FSMNLP'09 Proceedings of the 8th international conference on Finite-state methods and natural language processing
Schema label normalization for improving schema matching
Data & Knowledge Engineering
Managing personal documents with a digital library
ECDL'05 Proceedings of the 9th European conference on Research and Advanced Technology for Digital Libraries
High-recall extraction of acronym-definition pairs with relevance feedback
Proceedings of the 2012 Joint EDBT/ICDT Workshops
Hi-index | 0.00 |
Finding acronyms and their definitions in free text is useful for many purposes. Previous acronym definition finders relied heavily on heuristic methods. In contrast, we have developed a new method that uses several PPM models to encode the acronym in terms of its definition. Four different attributes of each acronym are encoded using a PPMD order 5 model: (a) whether the acronym occurred before or after its definition (direction) (b) the distance between the acronym and the definition (first-word offset) (c) the pattern of words in the definition with letters in the acronym (subsequent-word offsets) and (d) the number of letters taken from each of those words.These models, taken together, give a complete encoding of the acronym in terms of its definition. The models were trained on 1080 acronyms extracted from 150 documents. A model of plain text was trained using 100 independent documents from the same collection. Each word of the testing data was considered to be an acronym if it could be constructed from the initial letters of the 16 words on either side. If the word could be composed in more than one way, each possibility was considered. The entropy of the four attributes is calculated, the entropy of the word as normal text. The ratio of the two is compared to a threshold to determine whether to declare the word an acronym.The new method is shown to outperform existing heuristic methods for acronym extraction. Figure 1 shows the results of separate experiments performed with acronyms of two or more letters and acronyms of three or more letters. The difference between the two recall--precision curves can be explained by the higher probability that a given non-acronym, two-letter, word can be formed from the words around it.