Intelligent Text Extraction from PDF Documents

Authors:
Tamir Hassan;Robert Baumgartner
Affiliations:
Vienna University of Technology, Austria;Vienna University of Technology, Austria
Venue:
CIMCA '05 Proceedings of the International Conference on Computational Intelligence for Modelling, Control and Automation and International Conference on Intelligent Agents, Web Technologies and Internet Commerce Vol-2 (CIMCA-IAWTIC'06) - Volume 02
Year:
2005

Citing 0
Cited 2

Towards a System for Ontology-Based Information Extraction from PDF Documents

OTM '08 Proceedings of the OTM 2008 Confederated International Conferences, CoopIS, DOA, GADA, IS, and ODBASE 2008. Part II on On the Move to Meaningful Internet Systems
An approach for adding noise-tolerance to restricted-domain information retrieval

NLDB'10 Proceedings of the Natural language processing and information systems, and 15th international conference on Applications of natural language to information systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

In recent years, PDF has become the de-facto standard for the exchange of print-oriented documents on the Web. This includes many business documents such as financial reports, newsletters and patent applications, and there are many commercial applications that require data to be extracted from these documents and processed by computer systems. A number of products currently exist on the market that navigate, extract and transform data from HTML pages; a process known as wrapping. One such methodology is Lixto1, a product of research at our institute. However, none of these products are currently able to work with PDF files. We are investigating this possibility as part of the NEXTWRAP project. This paper describes our work in progress, and details some of the low-level page segmentation techniques that we have investigated.