A comparative study of citations and links in document classification

  • Authors:
  • Thierson Couto;Marco Cristo;Marcos André Gonçalves;Pável Calado;Nivio Ziviani;Edleno Moura;Berthier Ribeiro-Neto

  • Affiliations:
  • University of Minas Gerais, Belo Horizonte, Brazil;University of Minas Gerais, Belo Horizonte, Brazil;University of Minas Gerais, Belo Horizonte, Brazil;IST/INESC-ID, Lisboa, Portugal;University of Minas Gerais, Belo Horizonte, Brazil;Federal University of Amazonas, Manaus, Brazil;Federal University Minas Gerais, Belo Horizonte, Brazil and Google Engineering, Belo Horizonte, Brazil

  • Venue:
  • Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

It is well known that links are an important source of information when dealing with Web collections. However, the question remains on whether the same techniques that are used on the Web can be applied to collections of documents containing citations between scientific papers. In this work we present a comparative study of digital library citations and Web links, in the context of automatic text classification. We show that there are in fact differences between citations and links in this context. For the comparison, we run a series of experiments using a digital library of computer science papers and a Web directory. In our reference collections, measures based on co-citation tend to perform better for pages in the Web directory, with gains up to 37% over text based classifiers, while measures based on bibliographic coupling perform better in a digital library. We also propose a simple and effective way of combining a traditional text based classifier with a citation-link based classifier. This combination is based on the notion of classifier reliability and presented gains of up to 14% in micro-averaged F1 in the Web collection. However, no significant gain was obtained in the digital library. Finally, a user study was performed to further investigate the causes for these results. We discovered that misclassifications by the citation-link based classifiers are in fact difficult cases, hard to classify even for humans.