Discovering a term taxonomy from term similarities using principal component analysis

Authors:
Holger Bast;Georges Dupret;Debapriyo Majumdar;Benjamin Piwowarski
Affiliations:
Max-Planck-Institut für Informatik, Saarbrücken;Yahoo! Research Latin America;Max-Planck-Institut für Informatik, Saarbrücken;Yahoo! Research Latin America
Venue:
EWMF'05/KDO'05 Proceedings of the 2005 joint international conference on Semantics, Web and Mining
Year:
2005

Citing 20
Cited 4

A statistical interpretation of term specificity and its application in retrieval

Document retrieval systems
Automatic thesaurus construction using Bayesian networks

CIKM '95 Proceedings of the fourth international conference on Information and knowledge management
Latent semantic indexing: a probabilistic analysis

PODS '98 Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
The paraphrase search assistant: terminological feedback for iterative information seeking

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Deriving concept hierarchies from text

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Snowball: extracting relations from large plain-text collections

DL '00 Proceedings of the fifth ACM conference on Digital libraries
Finding topic words for hierarchical summarization

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Hierarchical presentation of expansion terms

Proceedings of the 2002 ACM symposium on Applied computing
Inferring hierarchical descriptions

Proceedings of the eleventh international conference on Information and knowledge management
Building and applying a concept hierarchy representation of a user profile

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Latent concepts and the number orthogonal factors in latent semantic analysis

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Generating hierarchical summaries for web searches

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Conceptual Indexing: A Better Way to Organize Knowledge

Conceptual Indexing: A Better Way to Organize Knowledge
Automatic acquisition of hyponyms from large text corpora

COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 2
A practical web-based approach to generating topic hierarchy for text segments

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Gimme' the context: context-driven automatic semantic annotation with C-PANKOW

WWW '05 Proceedings of the 14th international conference on World Wide Web
Why spectral retrieval works

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Unsupervised named-entity extraction from the Web: An experimental study

Artificial Intelligence
Semantic annotation for knowledge management: Requirements and a survey of the state of the art

Web Semantics: Science, Services and Agents on the World Wide Web
Principal components for automatic term hierarchy building

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval

A Term-Based Driven Clustering Approach for Name Disambiguation

APWeb/WAIM '09 Proceedings of the Joint International Conferences on Advances in Data and Web Management
Constructing reference sets from unstructured, ungrammatical text

Journal of Artificial Intelligence Research
Selecting candidate labels for hierarchical document clusters using association rules

MICAI'10 Proceedings of the 9th Mexican international conference on Artificial intelligence conference on Advances in soft computing: Part II
Improving hierarchical document cluster labels through candidate term selection

Intelligent Decision Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

We show that eigenvector decomposition can be used to extract a term taxonomy from a given collection of text documents. So far, methods based on eigenvector decomposition, such as latent semantic indexing (LSI) or principal component analysis (PCA), were only known to be useful for extracting symmetric relations between terms. We give a precise mathematical criterion for distinguishing between four kinds of relations of a given pair of terms of a given collection: unrelated (car – fruit), symmetrically related (car – automobile), asymmetrically related with the first term being more specific than the second (banana – fruit), and asymmetrically related in the other direction (fruit – banana). We give theoretical evidence for the soundness of our criterion, by showing that in a simplified mathematical model the criterion does the apparently right thing. We applied our scheme to the reconstruction of a selected part of the open directory project (ODP) hierarchy, with promising results.