CEBBIP: a parser of bibliographic information in chinese electronic books

  • Authors:
  • Liangcai Gao;Zhi Tang;Xiaofan Lin

  • Affiliations:
  • Institute of Computer Science and Technology of Peking University, Beijing, China;Institute of Computer Science and Technology of Peking University, Beijing, China;Vobile Inc., Santa Clara, USA

  • Venue:
  • Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Bibliographic information is essential for many digital library applications, such as citation analysis, academic searching and topic discovery. And bibliographic data extraction has attracted a great deal of attention in recent years. In this paper, we address the problem of automatic extraction of bibliographic data in Chinese electronic book and propose a tool called CEBBIP* for the task, which includes three main systems: data preprocessing, data parsing and data postprocessing. In the data preprocessing system, the tool adopts a rules-based method to locate citation data in a book and to segment citation data into citation strings of individual referencing literature. And a learning-based approach, Conditional Random Fields (CRF), is employed to parse citation strings in the data parsing system. Finally, the tool takes advantage of document intrinsic local format consistency to enhance citation data segmentation and parsing through clustering techniques. CEBBIP has been used in a commercial E-book production system. Experimental results show that CEBBIP's precision rate is very high. More specially, adopting the document intrinsic local format consistency obviously improves the citation data segmenting and parsing accuracy.