CEBBIP: a parser of bibliographic information in chinese electronic books

Authors:
Liangcai Gao;Zhi Tang;Xiaofan Lin
Affiliations:
Institute of Computer Science and Technology of Peking University, Beijing, China;Institute of Computer Science and Technology of Peking University, Beijing, China;Vobile Inc., Santa Clara, USA
Venue:
Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Year:
2009

Citing 9
Cited 3

CiteSeer: an autonomous Web agent for automatic retrieval and identification of interesting publications

AGENTS '98 Proceedings of the second international conference on Autonomous agents
Data clustering: a review

ACM Computing Surveys (CSUR)
Automatic document metadata extraction using support vector machines

Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
Bibliographic attribute extraction from erroneous references based on a statistical model

Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
A Segmentation Method for Bibliographic References by Contextual Tagging of Fields

ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
Reference metadata extraction using a hierarchical knowledge representation framework

Decision Support Systems
FLUX-CIM: flexible unsupervised extraction of citation metadata

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Bibliographic Attributes Extraction with Layer-upon-Layer Tagging

ICDAR '07 Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 02
BibPro: A Citation Parser Based on Sequence Alignment Techniques

AINAW '08 Proceedings of the 22nd International Conference on Advanced Information Networking and Applications - Workshops

Structure extraction from PDF-based book documents

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Web-based citation parsing, correction and augmentation

Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries
Extracting and matching authors and affiliations in scholarly documents

Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries

Quantified Score

Hi-index	0.00

Visualization

Abstract

Bibliographic information is essential for many digital library applications, such as citation analysis, academic searching and topic discovery. And bibliographic data extraction has attracted a great deal of attention in recent years. In this paper, we address the problem of automatic extraction of bibliographic data in Chinese electronic book and propose a tool called CEBBIP* for the task, which includes three main systems: data preprocessing, data parsing and data postprocessing. In the data preprocessing system, the tool adopts a rules-based method to locate citation data in a book and to segment citation data into citation strings of individual referencing literature. And a learning-based approach, Conditional Random Fields (CRF), is employed to parse citation strings in the data parsing system. Finally, the tool takes advantage of document intrinsic local format consistency to enhance citation data segmentation and parsing through clustering techniques. CEBBIP has been used in a commercial E-book production system. Experimental results show that CEBBIP's precision rate is very high. More specially, adopting the document intrinsic local format consistency obviously improves the citation data segmenting and parsing accuracy.