Class-based n-gram models of natural language
Computational Linguistics
Discovering word senses from text
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Building a large annotated corpus of English: the penn treebank
Computational Linguistics - Special issue on using large corpora: II
The interaction of knowledge sources in word sense disambiguation
Computational Linguistics
Part-of-speech induction from scratch
ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
Using a semantic concordance for sense identification
HLT '94 Proceedings of the workshop on Human Language Technology
Inducing syntactic categories by context distribution clustering
ConLL '00 Proceedings of the 2nd workshop on Learning language in logic and the 4th conference on Computational natural language learning - Volume 7
WordNet: similarity - measuring the relatedness of concepts
AAAI'04 Proceedings of the 19th national conference on Artifical intelligence
Hi-index | 0.00 |
A single unsupervised algorithm called lexical context deconvolution (LCD) is proposed to discover the semantic categories (senses) of polysemous words (those with more than one meaning) and the syntactic categories (parts of speech) of ambiguous words (those with more than one part of speech), relying solely on training with raw and unannotated text. No dictionaries, part-of-speech lexicons, stop-word lists, etc., are required or used. The knowledge about semantic and syntactic categories is acquired by collecting statistics from the lexical contexts in which words are found. The method first finds compact clusters of semantically similar words, which are assumed to represent the different semantic categories present in training texts. Discovering a given polysemous word's semantic categories is then treated as a problem of deconvolution. A target word's coöccurrence feature vector is assumed to be a linear combination of the categories' coöccurrence feature vectors. Thus a word's semantic categories are discovered by finding the non-negative least-squares solution to the system of linear equations formed by the word's and the categories' coöccurrence feature vectors. Finding syntactic categories of a word is accomplished by changing the word's feature vectors from coöccurrences to n-grams; every other part of the algorithm remains unchanged.