Automatic extraction of the multiple semantic and syntactic categories of words

Authors:
David Portnoy;Peter Bock
Affiliations:
The George Washington University, Washington, DC;The George Washington University, Washington, DC
Venue:
AIAP'07 Proceedings of the 25th conference on Proceedings of the 25th IASTED International Multi-Conference: artificial intelligence and applications
Year:
2007

Citing 8
Cited 0

Class-based n-gram models of natural language

Computational Linguistics
Discovering word senses from text

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
The interaction of knowledge sources in word sense disambiguation

Computational Linguistics
Part-of-speech induction from scratch

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
Using a semantic concordance for sense identification

HLT '94 Proceedings of the workshop on Human Language Technology
Inducing syntactic categories by context distribution clustering

ConLL '00 Proceedings of the 2nd workshop on Learning language in logic and the 4th conference on Computational natural language learning - Volume 7
WordNet: similarity - measuring the relatedness of concepts

AAAI'04 Proceedings of the 19th national conference on Artifical intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

A single unsupervised algorithm called lexical context deconvolution (LCD) is proposed to discover the semantic categories (senses) of polysemous words (those with more than one meaning) and the syntactic categories (parts of speech) of ambiguous words (those with more than one part of speech), relying solely on training with raw and unannotated text. No dictionaries, part-of-speech lexicons, stop-word lists, etc., are required or used. The knowledge about semantic and syntactic categories is acquired by collecting statistics from the lexical contexts in which words are found. The method first finds compact clusters of semantically similar words, which are assumed to represent the different semantic categories present in training texts. Discovering a given polysemous word's semantic categories is then treated as a problem of deconvolution. A target word's coöccurrence feature vector is assumed to be a linear combination of the categories' coöccurrence feature vectors. Thus a word's semantic categories are discovered by finding the non-negative least-squares solution to the system of linear equations formed by the word's and the categories' coöccurrence feature vectors. Finding syntactic categories of a word is accomplished by changing the word's feature vectors from coöccurrences to n-grams; every other part of the algorithm remains unchanged.