Learning metadata from the evidence in an on-line citation matching scheme

Authors:
Isaac G. Councill;Huajing Li;Ziming Zhuang;Sandip Debnath;Levent Bolelli;Wang Chien Lee;Anand Sivasubramaniam;C. Lee Giles
Affiliations:
The Pennsylvania State University, University Park, PA;The Pennsylvania State University, University Park, PA;The Pennsylvania State University, University Park, PA;The Pennsylvania State University, University Park, PA;The Pennsylvania State University, University Park, PA;The Pennsylvania State University, University Park, PA;The Pennsylvania State University, University Park, PA;The Pennsylvania State University, University Park, PA
Venue:
Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Year:
2006

Citing 16
Cited 4

Probabilistic reasoning in intelligent systems: networks of plausible inference

Probabilistic reasoning in intelligent systems: networks of plausible inference
CiteSeer: an automatic citation indexing system

Proceedings of the third ACM conference on Digital libraries
Distributed error correction

Proceedings of the fourth ACM conference on Digital libraries
Efficient clustering of high-dimensional data sets with application to reference matching

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Persistence of information on the web: analyzing citations contained in research articles

Proceedings of the ninth international conference on Information and knowledge management
Digital Libraries and Autonomous Citation Indexing

Computer
Automatic Extraction of Reference Linking Information from Online Documents

Automatic Extraction of Reference Linking Information from Online Documents
Objective quality ranking of computing journals

Communications of the ACM - Service-oriented computing
Citation Recognition for Scientific Publications in Digital Libraries

DIAL '04 Proceedings of the First International Workshop on Document Image Analysis for Libraries (DIAL'04)
Resolving citations in a paper repository

ACM SIGKDD Explorations Newsletter
The myth of the double-blind review?: author identification using only citations

ACM SIGKDD Explorations Newsletter
Identity uncertainty

Identity uncertainty
A service-oriented architecture for digital libraries

Proceedings of the 2nd international conference on Service oriented computing
An integrated, conditional model of information extraction and coreference with application to citation matching

UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
Name disambiguation in author citations using a K-way spectral clustering method

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Confidence estimation for information extraction

HLT-NAACL-Short '04 Proceedings of HLT-NAACL 2004: Short Papers

Adaptive graphical approach to entity resolution

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Effectively Searching Maps in Web Documents

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
SeerSuite: developing a scalable and reliable application framework for building digital libraries by crawling the web

WebApps'10 Proceedings of the 2010 USENIX conference on Web application development
On generating large-scale ground truth datasets for the deduplication of bibliographic records

Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Citation matching, or the automatic grouping of bibliographic references that refer to the same document, is a data management problem faced by automatic digital libraries for scientific literature such as CiteSeer and Google Scholar. Although several solutions have been offered for citation matching in large bibliographic databases, these solutions typically require expensive batch clustering operations that must be run offline. Large digital libraries containing citation information can reduce maintenance costs and provide new services through efficient online processing of citation data, resolving document citation relationships as new records become available. Additionally, information found in citations can be used to supplement document metadata, requiring the generation of a canonical citation record from merging variant citation subfields into a unified "best guess" from which to draw information. Citation information must be merged with other information sources in order to provide a complete document record. This paper outlines a system and algorithms for online citation matching and canonical metadata generation. A Bayesian framework is employed to build the ideal citation record for a document that carries the added advantages of fusing information from disparate sources and increasing system resilience to erroneous data.