Temporal contexts: Effective text classification in evolving document collections

  • Authors:
  • Leonardo Rocha;Fernando MourãO;Hilton Mota;Thiago Salles;Marcos André GonçAlves;Wagner Meira Jr.

  • Affiliations:
  • Federal University of São João del-Rei, Computer Science Department-São João del-Rei, Brazil;Federal University of Minas Gerais, Computer Science Department-Belo Horizonte, Brazil;Federal University of Minas Gerais, Electrical Engineering Department-Belo Horizonte, Brazil;Federal University of Minas Gerais, Computer Science Department-Belo Horizonte, Brazil;Federal University of Minas Gerais, Computer Science Department-Belo Horizonte, Brazil;Federal University of Minas Gerais, Computer Science Department-Belo Horizonte, Brazil

  • Venue:
  • Information Systems
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

The management of a huge and growing amount of information available nowadays makes Automatic Document Classification (ADC), besides crucial, a very challenging task. Furthermore, the dynamics inherent to classification problems, mainly on the Web, make this task even more challenging. Despite this fact, the actual impact of such temporal evolution on ADC is still poorly understood in the literature. In this context, this work concerns to evaluate, characterize and exploit the temporal evolution to improve ADC techniques. As first contribution we highlight the proposal of a pragmatical methodology for evaluating the temporal evolution in ADC domains. Through this methodology, we can identify measurable factors associated to ADC models degradation over time. Going a step further, based on such analyzes, we propose effective and efficient strategies to make current techniques more robust to natural shifts over time. We present a strategy, named temporal context selection, for selecting portions of the training set that minimize those factors. Our second contribution consists of proposing a general algorithm, called Chronos, for determining such contexts. By instantiating Chronos, we are able to reduce uncertainty and improve the overall classification accuracy. Empirical evaluations of heuristic instantiations of the algorithm, named WindowsChronos and FilterChronos, on two real document collections demonstrate the usefulness of our proposal. Comparing them against state-of-the-art ADC algorithms shows that selecting temporal contexts allows improvements on the classification accuracy up to 10%. Finally, we highlight the applicability and the generality of our proposal in practice, pointing out this study as a promising research direction.