Automatic document metadata extraction using support vector machines
Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
Information extraction from research papers using conditional random fields
Information Processing and Management: an International Journal
Automatic extraction of titles from general documents using machine learning
Information Processing and Management: an International Journal
A practical method for compatibility evaluation of portable document formats
ACIIDS'13 Proceedings of the 5th Asian conference on Intelligent Information and Database Systems - Volume Part II
Evaluation of header metadata extraction approaches and tools for scientific PDF documents
Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
Docear's PDF inspector: title extraction from PDF files
Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
Hi-index | 0.00 |
Extracting titles from a PDF's full text is an important task in information retrieval to identify PDFs. Existing approaches apply complicated and expensive (in terms of calculating power) machine learning algorithms such as Support Vector Machines and Conditional Random Fields. In this paper we present a simple rule based heuristic, which considers style information (font size) to identify a PDF's title. In a first experiment we show that this heuristic delivers better results (77.9% accuracy) than a support vector machine by CiteSeer (69.4% accuracy) in an 'academic search engine' scenario and better run times (8:19 minutes vs. 57:26 minutes).