Taxonomy generation for text segments: A practical web-based approach

  • Authors:
  • Shui-Lung Chuang;Lee-Feng Chien

  • Affiliations:
  • Institute of Information Science, Academia Sinica, Taipei, Taiwan;Institute of Information Science, Academia Sinica and Department of Information Management, National Taiwan University

  • Venue:
  • ACM Transactions on Information Systems (TOIS)
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

It is crucial in many information systems to organize short text segments, such as keywords in documents and queries from users, into a well-formed taxonomy. In this article, we address the problem of taxonomy generation for diverse text segments with a general and practical approach that uses the Web as an additional knowledge source. Unlike long documents, short text segments typically do not contain enough information to extract reliable features. This work investigates the possibilities of using highly ranked search-result snippets to enrich the representation of text segments. A hierarchical clustering algorithm is then designed for creating the hierarchical topic structure of text segments. Text segments with close concepts can be grouped together in a cluster, and relevant clusters linked at the same or near levels. Different from traditional clustering algorithms, which tend to produce cluster hierarchies with a very unnatural shape, the algorithm tries to produce a more natural and comprehensive tree hierarchy. Extensive experiments were conducted on different domains of text segments, including subject terms, people names, paper titles, and natural language questions. The obtained experimental results have shown the potential of the proposed approach, which provides a basis for the in-depth analysis of text segments on a larger scale and is believed able to benefit many information systems.