Building thematic lexical resources by term categorization

Authors:
Alberto Lavelli;Bernardo Magnini;Fabrizio Sebastiani
Affiliations:
ITC-irst, Trento, Italy;ITC-irst, Trento, Italy;Istituto di Elaborazione dell'Informazione, Pisa, Italy
Venue:
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Year:
2002

Citing 4
Cited 0

A Parallel Computing Approach to Creating Engineering Concept Spaces for Semantic Retrieval: The Illinois Digital Library Initiative Project

IEEE Transactions on Pattern Analysis and Machine Intelligence
BoosTexter: A Boosting-based Systemfor Text Categorization

Machine Learning - Special issue on information retrieval
An improved boosting algorithm and its application to text categorization

Proceedings of the ninth international conference on Information and knowledge management
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We discuss the automatic generation of thematic lexicons by means of term categorization, a novel task employing techniques from information retrieval (IR) and machine learning (ML). Specifically, we view the generation of such lexicons as an iterative process of learning previously unknown associations between terms and themes (i.e. disciplines, or fields of activity). The process is iterative, in that it generates, for each ci in a set C = {c1,...,cm} of themes, a sequence Li0⊆ Li1⊆ ... ⊆ Lin of lexicons, bootstrapping from an initial lexicon Li0 and a set of text corpora &THgr; = {&thgr;0,...,&thgr;n-1} given as input. The method is inspired by text categorization, the discipline concerned with labelling natural language texts with labels from a predefined set of themes, or categories. However, while text categorization deals with documents represented as vectors in a space of terms, term categorization deals (dually) with terms represented as vectors in a space of documents, and labels terms (instead of documents) with themes. As a learning device we adopt boosting, since (a) it has demonstrated state-of-the-art effectiveness in a variety of text categorization applications, and (b) it naturally allows for a form of "data cleaning", thereby making the process of generating a thematic lexicon an iteration of generate-and-test steps.