A comparative study of citations and links in document classification

Authors:
Thierson Couto;Marco Cristo;Marcos André Gonçalves;Pável Calado;Nivio Ziviani;Edleno Moura;Berthier Ribeiro-Neto
Affiliations:
University of Minas Gerais, Belo Horizonte, Brazil;University of Minas Gerais, Belo Horizonte, Brazil;University of Minas Gerais, Belo Horizonte, Brazil;IST/INESC-ID, Lisboa, Portugal;University of Minas Gerais, Belo Horizonte, Brazil;Federal University of Amazonas, Manaus, Brazil;Federal University Minas Gerais, Belo Horizonte, Brazil and Google Engineering, Belo Horizonte, Brazil
Venue:
Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Year:
2006

Citing 21
Cited 7

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
Expert network: effective and efficient learning from human decisions in text categorization and retrieval

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Enhanced hypertext categorization using hyperlinks

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Autonomous citation matching

Proceedings of the third annual conference on Autonomous Agents
Finding related pages in the World Wide Web

WWW '99 Proceedings of the eighth international conference on World Wide Web
Trawling the Web for emerging cyber-communities

WWW '99 Proceedings of the eighth international conference on World Wide Web
Associative Document Retrieval Techniques Using Bibliographic Information

Journal of the ACM (JACM)
Authoritative sources in a hyperlinked environment

Journal of the ACM (JACM)
A practical hypertext catergorization method using links and incrementally available class information

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Using web structure for classifying and describing web pages

Proceedings of the 11th international conference on World Wide Web
Web classification using support vector machine

Proceedings of the 4th international workshop on Web information and data management
A Study of Approaches to Hypertext Categorization

Journal of Intelligent Information Systems
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Composite Kernels for Hypertext Categorisation

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
CoBWeb A Crawler for the Brazilian Web

SPIRE '99 Proceedings of the String Processing and Information Retrieval Symposium & International Workshop on Groupware
Combining link-based and content-based methods for web document classification

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Adaptive sampling for thresholding in document filtering and classification

Information Processing and Management: an International Journal
Link-based similarity measures for the classification of Web documents

Journal of the American Society for Information Science and Technology
When are links useful? experiments in text classification

ECIR'03 Proceedings of the 25th European conference on IR research
LIBSVM: A library for support vector machines

ACM Transactions on Intelligent Systems and Technology (TIST)

FLUX-CIM: flexible unsupervised extraction of citation metadata

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Clustering as an approach to support the automatic definition of semantic hyperlinks

Proceedings of the eighteenth conference on Hypertext and hypermedia
Citation-based methods for personalized search in digital libraries

WISE'07 Proceedings of the 2007 international conference on Web information systems engineering
Hybrid method for personalized search in scientific digital libraries

CICLing'08 Proceedings of the 9th international conference on Computational linguistics and intelligent text processing
Hybrid method for personalized search in digital libraries

ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
Classifying documents with link-based bibliometric measures

Information Retrieval
Word co-occurrence features for text classification

Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

It is well known that links are an important source of information when dealing with Web collections. However, the question remains on whether the same techniques that are used on the Web can be applied to collections of documents containing citations between scientific papers. In this work we present a comparative study of digital library citations and Web links, in the context of automatic text classification. We show that there are in fact differences between citations and links in this context. For the comparison, we run a series of experiments using a digital library of computer science papers and a Web directory. In our reference collections, measures based on co-citation tend to perform better for pages in the Web directory, with gains up to 37% over text based classifiers, while measures based on bibliographic coupling perform better in a digital library. We also propose a simple and effective way of combining a traditional text based classifier with a citation-link based classifier. This combination is based on the notion of classifier reliability and presented gains of up to 14% in micro-averaged F1 in the Web collection. However, no significant gain was obtained in the digital library. Finally, a user study was performed to further investigate the causes for these results. We discovered that misclassifications by the citation-link based classifiers are in fact difficult cases, hard to classify even for humans.