Learning a taxonomy from a set of text documents

  • Authors:
  • Mari-Sanna Paukkeri;Alberto Pérez García-Plaza;Víctor Fresno;Raquel Martínez Unanue;Timo Honkela

  • Affiliations:
  • Aalto University School of Science, Adaptive Informatics Research Centre, P.O. Box 15400, FI-00076 Aalto, Finland;NLP & IR Group, E.T.S.I. Informáática, UNED, 28040 Madrid, Spain;NLP & IR Group, E.T.S.I. Informáática, UNED, 28040 Madrid, Spain;NLP & IR Group, E.T.S.I. Informáática, UNED, 28040 Madrid, Spain;Aalto University School of Science, Adaptive Informatics Research Centre, P.O. Box 15400, FI-00076 Aalto, Finland

  • Venue:
  • Applied Soft Computing
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present a methodology for learning a taxonomy from a set of text documents that each describes one concept. The taxonomy is obtained by clustering the concept definition documents with a hierarchical approach to the Self-Organizing Map. In this study, we compare three different feature extraction approaches with varying degree of language independence. The feature extraction schemes include fuzzy logic-based feature weighting and selection, statistical keyphrase extraction, and the traditional tf-idf weighting scheme. The experiments are conducted for English, Finnish, and Spanish. The results show that while the rule-based fuzzy logic systems have an advantage in automatic taxonomy learning, taxonomies can also be constructed with tolerable results using statistical methods without domain- or style-specific knowledge.