Unsupervised learning of the morpho-semantic relationship in MEDLINE®

Authors:
W. John Wilbur
Affiliations:
National Institutes of Health, Bethesda, MD
Venue:
BioNLP '07 Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing
Year:
2007

Citing 11
Cited 0

Finding approximate matches in large lexicons

Software—Practice & Experience
Guessing morphology from terms and corpora

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Corpus-based stemming using cooccurrence of word variants

ACM Transactions on Information Systems (TOIS)
Approximate String Matching

ACM Computing Surveys (CSUR)
String similarity and misspellings

Communications of the ACM
Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition

Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition
Minimally supervised morphological analysis by multimodal alignment

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Knowledge-free induction of morphology using latent semantic analysis

ConLL '00 Proceedings of the 2nd workshop on Learning language in logic and the 4th conference on Computational natural language learning - Volume 7
A framework for unsupervised natural language morphology induction

ACLstudent '04 Proceedings of the ACL 2004 workshop on Student research
A powerful and general approach to context exploitation in natural language processing

CLS '04 Proceedings of the HLT-NAACL Workshop on Computational Lexical Semantics
Morphology induction from term clusters

CONLL '05 Proceedings of the Ninth Conference on Computational Natural Language Learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

Morphological analysis as applied to English has generally involved the study of rules for inflections and derivations. Recent work has attempted to derive such rules from automatic analysis of corpora. Here we study similar issues, but in the context of the biological literature. We introduce a new approach which allows us to assign probabilities of the semantic relatedness of pairs of tokens that occur in text in consequence of their relatedness as character strings. Our analysis is based on over 84 million sentences that compose the MEDLINE database and over 2.3 million token types that occur in MEDLINE and enables us to identify over 36 million token type pairs which have assigned probabilities of semantic relatedness of at least 0.7 based on their similarity as strings.