Extracting semantic clusters from the alignment of definitions

Authors:
Gerardo Sierra;John McNaught
Affiliations:
Instituto de Ingeniería, UNAM, México;UMIST, Manchester, UK
Venue:
COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 2
Year:
2000

Citing 8
Cited 0

Analogical natural language processing

Analogical natural language processing
Issues in text-based lexicon acquisition

Corpus processing for lexical acquisition
Distinguished usage

Corpus processing for lexical acquisition
The String-to-String Correction Problem

Journal of the ACM (JACM)
Lexical cohesion computed by thesaural relations as an indicator of the structure of text

Computational Linguistics
A program for aligning sentences in bilingual corpora

ACL '91 Proceedings of the 29th annual meeting on Association for Computational Linguistics
Word sense disambiguation using Conceptual Density

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 1
Inherited Feature-based Similarity Measure based on large semantic hierarchy and large text corpus

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 1

Quantified Score

Hi-index	0.00

Visualization

Abstract

Through the alignment of definitions from two or more different sources, it is possible to retrieve pairs of words that can be used indistinguishably in the same sentence without changing the meaning of the concept. As lexicographic work exploits common defining schemes, such as genus and differentia, a concept is similarly defined by different dictionaries. The difference in words used between two lexicographic sources lets us extend the lexical knowledge base, so that clustering is available through merging two or more dictionaries into a single database and then using an appropriate alignment technique. Since alignment starts from the same entry of two dictionaries, clustering is faster than any other technique.The algorithm introduced here is analogy-based, and starts from calculating the Levenshtein distance, which is a variation of the edit distance, and allows us to align the definitions. As a measure of similarity, the concept of longest collocation couple is introduced, which is the basis of clustering similar words. The process iterates, replacing similar pairs of words in the definitions until no new clusters are found.