A study on information extraction from PDF files

Authors:
Fang Yuan;Bo Liu;Ge Yu
Affiliations:
College of Mathematics and Computer Science, Hebei University, Baoding, Hebei, P.R. China;College of Mathematics and Computer Science, Hebei University, Baoding, Hebei, P.R. China;College of Information Science and Engineering, Northeastern University, Shenyang, Liaoning, P.R. China
Venue:
ICMLC'05 Proceedings of the 4th international conference on Advances in Machine Learning and Cybernetics
Year:
2005

Citing 2
Cited 2

A hierarchical approach to wrapper induction

Proceedings of the third annual conference on Autonomous Agents
Wrapper induction: efficiency and expressiveness

Artificial Intelligence - Special issue on Intelligent internet systems

Job profiling in high performance printing

Proceedings of the 9th ACM symposium on Document engineering
Job profiling and queue management in high performance printing

Computer Science - Research and Development

Quantified Score

Hi-index	0.00

Visualization

Abstract

Portable Document Format (PDF) is increasingly being recognized as a common format of electronic documents. The prerequisite to management and indexing of PDF files is to extract information from them. This paper describes an approach for extracting information from PDF files. The key idea is to transform the text information parsed from PDF files into semi-structured information by injecting additional uniform tags. An extensible rule set is built on tags and other knowledge. Guided by the rules, one pattern matching algorithm based on a tree model is applied to obtain the necessary information. A further experiment proved that this method was effective.