Header metadata extraction from semi-structured documents using template matching

  • Authors:
  • Zewu Huang;Hai Jin;Pingpeng Yuan;Zongfen Han

  • Affiliations:
  • Cluster and Grid Computing Lab, Huazhong University of Science and Technology, Wuhan, China;Cluster and Grid Computing Lab, Huazhong University of Science and Technology, Wuhan, China;Cluster and Grid Computing Lab, Huazhong University of Science and Technology, Wuhan, China;Cluster and Grid Computing Lab, Huazhong University of Science and Technology, Wuhan, China

  • Venue:
  • OTM'06 Proceedings of the 2006 international conference on On the Move to Meaningful Internet Systems: AWeSOMe, CAMS, COMINF, IS, KSinBIT, MIOS-CIAO, MONET - Volume Part II
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

With the recent proliferation of documents, automatic metadata extraction from document becomes an important task In this paper, we propose a novel template matching based method for header metadata extraction form semi-structured documents stored in PDF In our approach, templates are defined, and the document is considered as strings with format Templates are used to guide finite state automaton (FSA) to extract header metadata of papers The testing results indicate that our approach can effectively extract metadata, without any training cost and available to some special situation This approach can effectively assist the automatic index creation in lots of fields such as digital libraries, information retrieval, and data mining.