Managing gigabytes (2nd ed.): compressing and indexing documents and images
Managing gigabytes (2nd ed.): compressing and indexing documents and images
SemTag and seeker: bootstrapping the semantic web via automated semantic annotation
WWW '03 Proceedings of the 12th international conference on World Wide Web
Indexing for fast categorisation
ACSC '03 Proceedings of the 26th Australasian computer science conference - Volume 16
Efficient Batch Top-k Search for Dictionary-based Entity Recognition
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Super-Scalar RAM-CPU Cache Compression
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Optimizing scoring functions and indexes for proximity search in type-annotated corpora
Proceedings of the 15th international conference on World Wide Web
Integrating compression and execution in column-oriented database systems
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Yago: a core of semantic knowledge
Proceedings of the 16th international conference on World Wide Web
Wikify!: linking documents to encyclopedic knowledge
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
EntityRank: searching entities directly and holistically
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
On placing skips optimally in expectation
WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
NAGA: harvesting, searching and ranking knowledge
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Learning to link with wikipedia
Proceedings of the 17th ACM conference on Information and knowledge management
A language modeling framework for expert finding
Information Processing and Management: an International Journal
Foundations and Trends in Databases
Exploiting web search engines to search structured databases
Proceedings of the 18th international conference on World wide web
Collective annotation of Wikipedia entities in web text
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Beyond pages: supporting efficient, scalable entity search with dual-inversion index
Proceedings of the 13th International Conference on Extending Database Technology
Scalable techniques for document identifier assignment in inverted indexes
Proceedings of the 19th international conference on World wide web
EntityEngine: answering entity-relationship queries using shallow semantics
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Web-scale entity-relation search architecture
Proceedings of the 20th international conference companion on World wide web
Collective entity linking in web text: a graph-based method
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
DBpedia spotlight: shedding light on the web of documents
Proceedings of the 7th International Conference on Semantic Systems
Robust disambiguation of named entities in text
EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Compressed perfect embedded skip lists for quick inverted-index lookups
SPIRE'05 Proceedings of the 12th international conference on String Processing and Information Retrieval
Improved text annotation with Wikipedia entities
Proceedings of the 28th Annual ACM Symposium on Applied Computing
Data-based research at IIT Bombay
ACM SIGMOD Record
Learning joint query interpretation and response ranking
Proceedings of the 22nd international conference on World Wide Web
Learning relatedness measures for entity linking
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Hi-index | 0.00 |
Entity relationship search at Web scale depends on adding dozens of entity annotations to each of billions of crawled pages and indexing the annotations at rates comparable to regular text indexing. Even small entity search benchmarks from TREC and INEX suggest that the entity catalog support thousands of entity types and tens to hundreds of millions of entities. The above targets raise many challenges, major ones being the design of highly compressed data structures in RAM for spotting and disambiguating entity mentions, and highly compressed disk-based annotation indices. These data structures cannot be readily built upon standard inverted indices. Here we present a Web scale entity annotator and annotation index. Using a new workload-sensitive compressed multilevel map, we fit statistical disambiguation models for millions of entities within 1.15GB of RAM, and spend about 0.6 core-milliseconds per disambiguation. In contrast, DBPedia Spotlight spends 158 milliseconds, Wikipedia Miner spends 21 milliseconds, and Zemanta spends 9.5 milliseconds. Our annotation indices use ideas from vertical databases to reduce storage by 30%. On 40x8 cores with 40x3 disk spindles, we can annotate and index, in about a day, a billion Web pages with two million entities and 200,000 types from Wikipedia. Index decompression and scan speed are comparable to MG4J.