Header metadata extraction from semi-structured documents using template matching

Authors:
Zewu Huang;Hai Jin;Pingpeng Yuan;Zongfen Han
Affiliations:
Cluster and Grid Computing Lab, Huazhong University of Science and Technology, Wuhan, China;Cluster and Grid Computing Lab, Huazhong University of Science and Technology, Wuhan, China;Cluster and Grid Computing Lab, Huazhong University of Science and Technology, Wuhan, China;Cluster and Grid Computing Lab, Huazhong University of Science and Technology, Wuhan, China
Venue:
OTM'06 Proceedings of the 2006 international conference on On the Move to Meaningful Internet Systems: AWeSOMe, CAMS, COMINF, IS, KSinBIT, MIOS-CIAO, MONET - Volume Part II
Year:
2006

Citing 10
Cited 1

Digital libraries and knowledge disaggregation: the use of journal article components

Proceedings of the third ACM conference on Digital libraries
Knowledge-based metadata extraction from PostScript files

DL '00 Proceedings of the fifth ACM conference on Digital libraries
A statistical learning learning model of text classification for support vector machines

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Automatic metadata generation & evaluation

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Maximum Entropy Markov Models for Information Extraction and Segmentation

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Digital Document Metadata in Organizations: Roles, Analytical Approaches, and Future Research Directions

HICSS '98 Proceedings of the Thirty-First Annual Hawaii International Conference on System Sciences - Volume 2
Automatic document metadata extraction using support vector machines

Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
Federating heterogeneous digital libraries by metadata harvesting

Federating heterogeneous digital libraries by metadata harvesting
A Dynamic Feature Generation System for Automated Metadata Extraction in Preservation of Digital Materials

DIAL '04 Proceedings of the First International Workshop on Document Image Analysis for Libraries (DIAL'04)
Automatic extraction of titles from general documents using machine learning

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries

Automatic metadata mining from multilingual enterprise content

Web Semantics: Science, Services and Agents on the World Wide Web

Quantified Score

Hi-index	0.00

Visualization

Abstract

With the recent proliferation of documents, automatic metadata extraction from document becomes an important task In this paper, we propose a novel template matching based method for header metadata extraction form semi-structured documents stored in PDF In our approach, templates are defined, and the document is considered as strings with format Templates are used to guide finite state automaton (FSA) to extract header metadata of papers The testing results indicate that our approach can effectively extract metadata, without any training cost and available to some special situation This approach can effectively assist the automatic index creation in lots of fields such as digital libraries, information retrieval, and data mining.