Optimization issues in inverted index-based entity annotation

Authors:
Ganesh Ramakrishnan;Sachindra Joshi;Sanjeet Khaitan;Sreeram Balakrishnan
Affiliations:
IBM India Research Lab, New Delhi, India;IBM India Research Lab, New Delhi, India;InfoSpace Inc. Bangalore, India;IBM Software Group, San Jose, CA, United States
Venue:
Proceedings of the 3rd international conference on Scalable information systems
Year:
2008

Citing 17
Cited 1

Introduction to algorithms

Introduction to algorithms
Query evaluation: strategies and optimizations

Information Processing and Management: an International Journal
Fast text searching for regular expressions or automaton searching on tries

Journal of the ACM (JACM)
Cost-based optimization of decision support queries using transient-views

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
Efficient and extensible algorithms for multi query optimization

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Approximating the smallest grammar: Kolmogorov complexity in natural models

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Database Systems Concepts

Database Systems Concepts
Efficient phrase querying with an auxiliary index

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Information Extraction: Techniques and Challenges

SCIE '97 International Summer School on Information Extraction: A Multidisciplinary Approach to an Emerging Information Technology
SemTag and seeker: bootstrapping the semantic web via automated semantic annotation

WWW '03 Proceedings of the 12th international conference on World Wide Web
Efficient query evaluation using a two-level retrieval process

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
SRI International FASTUS system: MUC-6 test results and analysis

MUC6 '95 Proceedings of the 6th conference on Message understanding
Optimization strategies for complex queries

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Optimizing scoring functions and indexes for proximity search in type-annotated corpora

Proceedings of the 15th international conference on World Wide Web
Avatar semantic search: a database approach to information retrieval

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Entity annotation based on inverse index operations

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing

SystemT: an algebraic approach to declarative information extraction

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

Quantified Score

Hi-index	0.01

Visualization

Abstract

Entity annotation is emerging as a key enabling requirement for search based on deeper semantics: for example, a search on 'John's address', that returns matches to all entities annotated as an address that co-occur with 'John'. A dominant paradigm adopted by rule-based named entity annotators is to annotate a document at a time. The complexity of this approach varies linearly with the number of documents and the cost for annotating each document, which could be prohibiting for large document corpora. A recently proposed alternative paradigm for rule-based entity annotation [16], operates on the inverted index of a document collection and achieves an order of magnitude speed-up over the document-based counterpart. In addition the index based approach permits collection level optimization of the order of index operations required for the annotation process. It is this aspect that is explored in this paper. We develop a polynomial time algorithm that, based on estimated cost, can optimally select between different logically equivalent evaluation plans for a given rule. Additionally, we prove that this problem becomes NP-hard when the optimization has to be performed over multiple rules and provide effective heuristics for handling this case. Our empirical evaluations show a speed-up factor upto 2 over the baseline system without optimizations.