Exploiting temporal contexts in text classification

  • Authors:
  • Leonardo Rocha;Fernando Mourão;Adriano Pereira;Marcos André Gonçalves;Wagner Meira, Jr.

  • Affiliations:
  • Federal University of Minas Gerais, Belo Horizonte, Brazil;Federal University of Minas Gerais, Belo Horizonte, Brazil;Federal University of Minas Gerais, Belo Horizonte, Brazil;Federal University of Minas Gerais, Belo Horizonte, Brazil;Federal University of Minas Gerais, Belo Horizonte, Brazil

  • Venue:
  • Proceedings of the 17th ACM conference on Information and knowledge management
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Due to the increasing amount of information being stored and accessible through the Web, Automatic Document Classification (ADC) has become an important research topic. ADC usually employs a supervised learning strategy, where we first build a classification model using pre-classified documents and then use it to classify unseen documents. One major challenge in building classifiers is dealing with the temporal evolution of the characteristics of the documents and the classes to which they belong. However, most of the current techniques for ADC do not consider this evolution while building and using the models. Previous results show that the performance of classifiers may be affected by three different temporal effects (class distribution, term distribution and class similarity). Further, it is shown that using just portions of the pre-classified documents, which we call contexts, for building the classifiers, result in better performance, as a consequence of the minimization of the aforementioned effects. In this paper we define the concept of temporal contexts as being the portions of documents that minimize those effects. We then propose a general algorithm for determining such contexts, discuss its implementation-related issues, and propose a heuristic that is able to determine temporal contexts efficiently. In order to demonstrate the effectiveness of our strategy, we evaluated it using two distinct collections: ACM-DL and MedLine. We initially evaluated the reduction in terms of both the effort to build a classifier and the entropy associated with each context. Further, we evaluated whether these observed reductions translate into better classification performance by employing a very simple classifier, majority voting. The results show that we achieved precision gains of up to 30% compared to a version that is not temporally contextualized, and the same accuracy of a state-of-the-art classifier (SVM), while presenting an execution time up to hundreds of times faster.