Compressed data structures for annotated web search

Authors:
Soumen Chakrabarti;Sasidhar Kasturi;Bharath Balakrishnan;Ganesh Ramakrishnan;Rohit Saraf
Affiliations:
IIT Bombay, Mumbai, India;IIT Bombay, Mumbai, India;IIT Bombay, Mumbai, India;IIT Bombay, Mumbai, India;IIT Bombay, Mumbai, India
Venue:
Proceedings of the 21st international conference on World Wide Web
Year:
2012

Citing 25
Cited 4

Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
SemTag and seeker: bootstrapping the semantic web via automated semantic annotation

WWW '03 Proceedings of the 12th international conference on World Wide Web
Indexing for fast categorisation

ACSC '03 Proceedings of the 26th Australasian computer science conference - Volume 16
Efficient Batch Top-k Search for Dictionary-based Entity Recognition

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Super-Scalar RAM-CPU Cache Compression

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Optimizing scoring functions and indexes for proximity search in type-annotated corpora

Proceedings of the 15th international conference on World Wide Web
Integrating compression and execution in column-oriented database systems

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Yago: a core of semantic knowledge

Proceedings of the 16th international conference on World Wide Web
Wikify!: linking documents to encyclopedic knowledge

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
EntityRank: searching entities directly and holistically

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
On placing skips optimally in expectation

WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
NAGA: harvesting, searching and ranking knowledge

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Learning to link with wikipedia

Proceedings of the 17th ACM conference on Information and knowledge management
A language modeling framework for expert finding

Information Processing and Management: an International Journal
Information Extraction

Foundations and Trends in Databases
Exploiting web search engines to search structured databases

Proceedings of the 18th international conference on World wide web
Collective annotation of Wikipedia entities in web text

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Beyond pages: supporting efficient, scalable entity search with dual-inversion index

Proceedings of the 13th International Conference on Extending Database Technology
Scalable techniques for document identifier assignment in inverted indexes

Proceedings of the 19th international conference on World wide web
EntityEngine: answering entity-relationship queries using shallow semantics

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Web-scale entity-relation search architecture

Proceedings of the 20th international conference companion on World wide web
Collective entity linking in web text: a graph-based method

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
DBpedia spotlight: shedding light on the web of documents

Proceedings of the 7th International Conference on Semantic Systems
Robust disambiguation of named entities in text

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Compressed perfect embedded skip lists for quick inverted-index lookups

SPIRE'05 Proceedings of the 12th international conference on String Processing and Information Retrieval

Improved text annotation with Wikipedia entities

Proceedings of the 28th Annual ACM Symposium on Applied Computing
Data-based research at IIT Bombay

ACM SIGMOD Record
Learning joint query interpretation and response ranking

Proceedings of the 22nd international conference on World Wide Web
Learning relatedness measures for entity linking

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Entity relationship search at Web scale depends on adding dozens of entity annotations to each of billions of crawled pages and indexing the annotations at rates comparable to regular text indexing. Even small entity search benchmarks from TREC and INEX suggest that the entity catalog support thousands of entity types and tens to hundreds of millions of entities. The above targets raise many challenges, major ones being the design of highly compressed data structures in RAM for spotting and disambiguating entity mentions, and highly compressed disk-based annotation indices. These data structures cannot be readily built upon standard inverted indices. Here we present a Web scale entity annotator and annotation index. Using a new workload-sensitive compressed multilevel map, we fit statistical disambiguation models for millions of entities within 1.15GB of RAM, and spend about 0.6 core-milliseconds per disambiguation. In contrast, DBPedia Spotlight spends 158 milliseconds, Wikipedia Miner spends 21 milliseconds, and Zemanta spends 9.5 milliseconds. Our annotation indices use ideas from vertical databases to reduce storage by 30%. On 40x8 cores with 40x3 disk spindles, we can annotate and index, in about a day, a billion Web pages with two million entities and 200,000 types from Wikipedia. Index decompression and scan speed are comparable to MG4J.