A comparison of on-line computer science citation databases

Authors:
Vaclav Petricek;Ingemar J. Cox;Hui Han;Isaac G. Councill;C. Lee Giles
Affiliations:
University College London, London, United Kingdom;University College London, London, United Kingdom;Yahoo! Inc., Sunnyvale, CA;The School of Information Sciences and Technology, The Pennsylvania State University, University Park, PA;The School of Information Sciences and Technology, The Pennsylvania State University, University Park, PA
Venue:
ECDL'05 Proceedings of the 9th European conference on Research and Advanced Technology for Digital Libraries
Year:
2005

Citing 3
Cited 7

Digital Libraries and Autonomous Citation Indexing

Computer
Shilling recommender systems for fun and profit

Proceedings of the 13th international conference on World Wide Web
REFEREE: an open framework for practical testing of recommender systems using ResearchIndex

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases

Record matching in digital library metadata

Communications of the ACM - Alternate reality gaming
Towards a model of computer systems research

WOWCS'08 Proceedings of the conference on Organizing Workshops, Conferences, and Symposia for Computer Systems
Using search strategies and a description logic paradigm with conditional preferences for literature search

International Journal of Metadata, Semantics and Ontologies
Conference reviewing considered harmful

ACM SIGOPS Operating Systems Review
An analysis of the evolving coverage of computer science sub-fields in the DBLP digital library

ECDL'10 Proceedings of the 14th European conference on Research and advanced technology for digital libraries
Construction of a large-scale test set for author disambiguation

Information Processing and Management: an International Journal
Tuning large scale deduplication with reduced effort

Proceedings of the 25th International Conference on Scientific and Statistical Database Management

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper examines the difference and similarities between the two on-line computer science citation databases DBLP and CiteSeer. The database entries in DBLP are inserted manually while the CiteSeer entries are obtained autonomously via a crawl of the Web and automatic processing of user submissions. CiteSeer's autonomous citation database can be considered a form of self-selected on-line survey. It is important to understand the limitations of such databases, particularly when citation information is used to assess the performance of authors, institutions and funding bodies. We show that the CiteSeer database contains considerably fewer single author papers. This bias can be modeled by an exponential process with intuitive explanation. The model permits us to predict that the DBLP database covers approximately 24% of the entire literature of Computer Science. CiteSeer is also biased against low-cited papers. Despite their difference, both databases exhibit similar and significantly different citation distributions compared with previous analysis of the Physics community. In both databases, we also observe that the number of authors per paper has been increasing over time.