Kairos: proactive harvesting of research paper metadata from scientific conference web sites

Authors:
Markus Hänse;Min-Yen Kan;Achim P. Karduck
Affiliations:
Hochschule Furtwangen University;Department of Computer Science, National University of Singapore;Hochschule Furtwangen University
Venue:
ICADL'10 Proceedings of the role of digital libraries in a time of global change, and 12th international conference on Asia-Pacific digital libraries
Year:
2010

Citing 8
Cited 2

Assessing agreement on classification tasks: the kappa statistic

Computational Linguistics
Web classification using support vector machine

Proceedings of the 4th international workshop on Web information and data management
Fast webpage classification using URL features

Proceedings of the 14th ACM international conference on Information and knowledge management
Information extraction from research papers using conditional random fields

Information Processing and Management: an International Journal
Guide focused crawler efficiently and effectively using on-line topical importance estimation

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Purely URL-based topic classification

Proceedings of the 18th international conference on World wide web
Interactive information extraction with constrained conditional random fields

AAAI'04 Proceedings of the 19th national conference on Artifical intelligence
Exploiting genre in focused crawling

SPIRE'07 Proceedings of the 14th international conference on String processing and information retrieval

Towards a comprehensive call ontology for Research 2.0

i-KNOW '11 Proceedings of the 11th International Conference on Knowledge Management and Knowledge Technologies
A Comprehensive Study of Techniques for URL-Based Web Page Language Classification

ACM Transactions on the Web (TWEB)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We investigate the automatic harvesting of research paper metadata from recent scholarly events. Our system, Kairos, combines a focused crawler and an information extraction engine, to convert a list of conference websites into a index filled with fields of metadata that correspond to individual papers. Using event date metadata extracted from the conference website, Kairos proactively harvests metadata about the individual papers soon after they are made public. We use a Maximum Entropy classifier to classify uniform resource locators (URLs) as scientific conference websites and use Conditional Random Fields (CRF) to extract individual paper metadata from such websites. Experiments show an acceptable measure of classification accuracy of over 95% for each of the two components.