Optimizing scoring functions and indexes for proximity search in type-annotated corpora

Authors:
Soumen Chakrabarti;Kriti Puniyani;Sujatha Das
Affiliations:
IIT Bombay;IIT Bombay;IIT Bombay
Venue:
Proceedings of the 15th international conference on World Wide Web
Year:
2006

Citing 26
Cited 25

On the limited memory BFGS method for large scale optimization

Mathematical Programming: Series A and B
An algorithm for string matching with a sequence of don't cares

Information Processing Letters
On the propagation of errors in the size of join results

SIGMOD '91 Proceedings of the 1991 ACM SIGMOD international conference on Management of data
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
Integrating keyword search into XML query processing

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
XIRQL: a query language for information retrieval in XML documents

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
An expressive and efficient language for XML information retrieval

Journal of the American Society for Information Science and Technology - XML
Efficient algorithms for document retrieval problems

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
The XXL search engine: ranked retrieval of XML data using indexes and ontologies

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Optimizing search engines using clickthrough data

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
SemTag and seeker: bootstrapping the semantic web via automated semantic annotation

WWW '03 Proceedings of the 12th international conference on World Wide Web
XRANK: ranked keyword search over XML documents

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Reachability and Distance Queries via 2-Hop Labels

SIAM Journal on Computing
GATE: a general architecture for text engineering

ANLC '97 Proceedings of the fifth conference on Applied natural language processing: Descriptions of system demonstrations and videos
On the Integration of Structure Indexes and Inverted Lists

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
MindNet: acquiring and structuring semantic information from text

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
Web-scale information extraction in knowitall: (preliminary results)

Proceedings of the 13th international conference on World Wide Web
Texquery: a full-text search extension to xquery

Proceedings of the 13th international conference on World Wide Web
Combining the language model and inference network approaches to retrieval

Information Processing and Management: an International Journal - Special issue: Bayesian networks and information retrieval
Improving Web search efficiency via a locality based static pruning method

WWW '05 Proceedings of the 14th international conference on World Wide Web
Three-level caching for efficient query processing in large Web search engines

WWW '05 Proceedings of the 14th international conference on World Wide Web
Gimme' the context: context-driven automatic semantic annotation with C-PANKOW

WWW '05 Proceedings of the 14th international conference on World Wide Web
A search engine for natural language applications

WWW '05 Proceedings of the 14th international conference on World Wide Web
Enhanced answer type inference from questions using sequential models

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
COMPASS: a concept-based web search engine for HTML, XML, and deep web data

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30

Indexing dataspaces

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Shine: search heterogeneous interrelated entities

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Ranking very many typed entities on wikipedia

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
EntityRank: searching entities directly and holistically

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Optimization issues in inverted index-based entity annotation

Proceedings of the 3rd international conference on Scalable information systems
Information Extraction

Foundations and Trends in Databases
Exploiting web search engines to search structured databases

Proceedings of the 18th international conference on World wide web
Tablerank: a ranking algorithm for table search and retrieval

AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 1
Effective, design-independent XML keyword search

Proceedings of the 18th ACM conference on Information and knowledge management
Data-oriented content query system: searching for data into text on the web

Proceedings of the third ACM international conference on Web search and data mining
Beyond pages: supporting efficient, scalable entity search with dual-inversion index

Proceedings of the 13th International Conference on Extending Database Technology
DoCQS: a prototype system for supporting data-oriented content query

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Finding support sentences for entities

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Index structures for efficiently searching natural language text

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
EntityEngine: answering entity-relationship queries using shallow semantics

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Entity-relationship queries over wikipedia

SMUC '10 Proceedings of the 2nd international workshop on Search and mining user-generated contents
Keyword++: a framework to improve keyword search over entity databases

Proceedings of the VLDB Endowment
Annotating and searching web tables using entities, types and relationships

Proceedings of the VLDB Endowment
Using structural information in XML keyword search effectively

ACM Transactions on Database Systems (TODS)
Web-scale entity-relation search architecture

Proceedings of the 20th international conference companion on World wide web
Index design and query processing for graph conductance search

The VLDB Journal — The International Journal on Very Large Data Bases
Chapter 7: dataspaces

Search Computing
Compressed data structures for annotated web search

Proceedings of the 21st international conference on World Wide Web
Optimizing index for taxonomy keyword search

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Entity-Relationship Queries over Wikipedia

ACM Transactions on Intelligent Systems and Technology (TIST)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We introduce a new, powerful class of text proximity queries: find an instance of a given "answer type" (person, place, distance) near "selector" tokens matching given literals or satisfying given ground predicates. An example query is type=distance NEAR Hamburg Munich. Nearness is defined as a flexible, trainable parameterized aggregation function of the selectors, their frequency in the corpus, and their distance from the candidate answer. Such queries provide a key data reduction step for information extraction, data integration, question answering, and other text-processing applications. We describe the architecture of a next-generation information retrieval engine for such applications, and investigate two key technical problems faced in building it. First, we propose a new algorithm that estimates a scoring function from past logs of queries and answer spans. Plugging the scoring function into the query processor gives high accuracy: typically, an answer is found at rank 2-4. Second, we exploit the skew in the distribution over types seen in query logs to optimize the space required by the new index structures required by our system. Extensive performance studies with a 10GB, 2-million document TREC corpus and several hundred TREC queries show both the accuracy and the efficiency of our system. From an initial 4.3GB index using 18,000 types from WordNet, we can discard 88% of the space, while inflating query times by a factor of only 1.9. Our final index overhead is only 20% of the total index space needed.