Evaluation of BILBO reference parsing in digital humanities via a comparison of different tools

Authors:
Young-Min Kim;Patrice Bellot;Jade Tavernier;Elodie Faath;Marin Dacos
Affiliations:
LIA, University of Avignon, Avignon, France;LSIS, Aix-Marseille University, Marseille, France;LIA, University of Avignon, Avignon, France;CLEO, Centre for Open Electronic Publishing, Marseille, France;CLEO, Centre for Open Electronic Publishing, Marseille, France
Venue:
Proceedings of the 2012 ACM symposium on Document engineering
Year:
2012

Citing 6
Cited 0

CiteSeer: an automatic citation indexing system

Proceedings of the third ACM conference on Digital libraries
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Information extraction from research papers using conditional random fields

Information Processing and Management: an International Journal
Comparing citation contexts for information retrieval

Proceedings of the 17th ACM conference on Information and knowledge management
Scientific paper summarization using citation summary networks

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Joint unsupervised structure discovery and information extraction

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data

Quantified Score

Hi-index	0.00

Visualization

Abstract

Automatic bibliographic reference annotation involves the tokenization and identification of reference fields. Recent methods use machine learning techniques such as Conditional Random Fields to tackle this problem. On the other hand, the state of the art methods always learn and evaluate their systems with a well structured data having simple format such as bibliography at the end of scientific articles. And that is a reason why the parsing of new reference different from a regular format does not work well. In our previous work, we have established a standard for the tokenization and feature selection with a less formulaic data such as notes. In this paper, we evaluate our system BILBO with other popular online reference parsing tools on a new data from totally different source. BILBO is constructed with our own corpora extracted and annotated from real world data, digital humanities articles of Revues.org site (90% in French) of OpenEdition. The robustness of BILBO system allows a language independent tagging result. We expect that this first attempt of evaluation will motivate the development of other efficient techniques for the scattered and less formulaic bibliographic references.