Metadata Extraction from Books with Facts about Austria

Authors:
Petra Korica-Pehserl;Hermann Maurer
Affiliations:
Institute for Information Systems and Computer Media, Inffeldgasse 16c, 8010 Graz, Austria;Institute for Information Systems and Computer Media, Inffeldgasse 16c, 8010 Graz, Austria
Venue:
Proceedings of International Conference on Information Integration and Web-based Applications & Services
Year:
2013

Citing 6
Cited 0

An Algorithm that Learns What‘s in a Name

Machine Learning - Special issue on natural language learning
Kernel methods for relation extraction

The Journal of Machine Learning Research
Automatic extraction of informative blocks from webpages

Proceedings of the 2005 ACM symposium on Applied computing
Information Extraction: Algorithms and Prospects in a Retrieval Context (The Information Retrieval Series)

Information Extraction: Algorithms and Prospects in a Retrieval Context (The Information Retrieval Series)
Incorporating non-local information into information extraction systems by Gibbs sampling

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Coupled semi-supervised learning for information extraction

Proceedings of the third ACM international conference on Web search and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Digitized fact books are valuable sources of knowledge. Full-text search is a powerful tool to access such knowledge. However, it often delivers too many results for general queries. Therefore we propose an approach to find relevant data by extracting metadata relevant for each page and allow to search for pages on the basis of their metadata as alternative to full-text search. Given the size of scanned data (high quality image scans) clearly this extraction cannot be done manually. As it turns out, although there are some common aspects, different books often need to be treated differently. In particular we can distinguish two kinds of books: lexicons (dictionaries) where items are arranged alphabetically and other books that describe various topics in a more narrative style. In this paper we describe the approach we used on different fact books in detail and share our learnings from this subject.