Assessing agreement on classification tasks: the kappa statistic
Computational Linguistics
Web classification using support vector machine
Proceedings of the 4th international workshop on Web information and data management
Fast webpage classification using URL features
Proceedings of the 14th ACM international conference on Information and knowledge management
Information extraction from research papers using conditional random fields
Information Processing and Management: an International Journal
Guide focused crawler efficiently and effectively using on-line topical importance estimation
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Purely URL-based topic classification
Proceedings of the 18th international conference on World wide web
Interactive information extraction with constrained conditional random fields
AAAI'04 Proceedings of the 19th national conference on Artifical intelligence
Exploiting genre in focused crawling
SPIRE'07 Proceedings of the 14th international conference on String processing and information retrieval
Towards a comprehensive call ontology for Research 2.0
i-KNOW '11 Proceedings of the 11th International Conference on Knowledge Management and Knowledge Technologies
A Comprehensive Study of Techniques for URL-Based Web Page Language Classification
ACM Transactions on the Web (TWEB)
Hi-index | 0.00 |
We investigate the automatic harvesting of research paper metadata from recent scholarly events. Our system, Kairos, combines a focused crawler and an information extraction engine, to convert a list of conference websites into a index filled with fields of metadata that correspond to individual papers. Using event date metadata extracted from the conference website, Kairos proactively harvests metadata about the individual papers soon after they are made public. We use a Maximum Entropy classifier to classify uniform resource locators (URLs) as scientific conference websites and use Conditional Random Fields (CRF) to extract individual paper metadata from such websites. Experiments show an acceptable measure of classification accuracy of over 95% for each of the two components.