An Algorithm that Learns What‘s in a Name
Machine Learning - Special issue on natural language learning
Kernel methods for relation extraction
The Journal of Machine Learning Research
Automatic extraction of informative blocks from webpages
Proceedings of the 2005 ACM symposium on Applied computing
Information Extraction: Algorithms and Prospects in a Retrieval Context (The Information Retrieval Series)
Incorporating non-local information into information extraction systems by Gibbs sampling
ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Coupled semi-supervised learning for information extraction
Proceedings of the third ACM international conference on Web search and data mining
Hi-index | 0.00 |
Digitized fact books are valuable sources of knowledge. Full-text search is a powerful tool to access such knowledge. However, it often delivers too many results for general queries. Therefore we propose an approach to find relevant data by extracting metadata relevant for each page and allow to search for pages on the basis of their metadata as alternative to full-text search. Given the size of scanned data (high quality image scans) clearly this extraction cannot be done manually. As it turns out, although there are some common aspects, different books often need to be treated differently. In particular we can distinguish two kinds of books: lexicons (dictionaries) where items are arranged alphabetically and other books that describe various topics in a more narrative style. In this paper we describe the approach we used on different fact books in detail and share our learnings from this subject.