Temporally-aware algorithms for document classification

  • Authors:
  • Thiago Salles;Leonardo Rocha;Gisele L. Pappa;Fernando Mourão;Wagner Meira, Jr.;Marcos Gonçalves

  • Affiliations:
  • Fed. Univ. of Minas Gerais, Belo Horizonte, Brazil;Fed. Univ. São João Del Rei, São João Del Rei, Brazil;Fed. Univ. of Minas Gerais, Belo Horizonte, Brazil;Fed. Univ. of Minas Gerais, Belo Horizonte, Brazil;Fed. Univ. of Minas Gerais, Belo Horizonte, Brazil;Fed. Univ. of Minas Gerais, Belo Horizonte, Brazil

  • Venue:
  • Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Automatic Document Classification (ADC) is still one of the major information retrieval problems. It usually employs a supervised learning strategy, where we first build a classification model using pre-classified documents and then use this model to classify unseen documents. The majority of supervised algorithms consider that all documents provide equally important information. However, in practice, a document may be considered more or less important to build the classification model according to several factors, such as its timeliness, the venue where it was published in, its authors, among others. In this paper, we are particularly concerned with the impact that temporal effects may have on ADC and how to minimize such impact. In order to deal with these effects, we introduce a temporal weighting function (TWF) and propose a methodology to determine it for document collections. We applied the proposed methodology to ACM-DL and Medline and found that the TWF of both follows a lognormal. We then extend three ADC algorithms (namely kNN, Rocchio and Naïve Bayes) to incorporate the TWF. Experiments showed that the temporally-aware classifiers achieved significant gains, outperforming (or at least matching) state-of-the-art algorithms.