Data exchange: getting to the core
ACM Transactions on Database Systems (TODS) - Special Issue: SIGMOD/PODS 2003
Duplicate Record Detection: A Survey
IEEE Transactions on Knowledge and Data Engineering
Canonicalization of database records using adaptive similarity measures
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Aggregating inconsistent information: Ranking and clustering
Journal of the ACM (JACM)
Large-Scale Deduplication with Constraints Using Dedupalog
ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
PRIMA: archiving and querying historical data with evolving schemas
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Structural characterizations of schema-mapping languages
Communications of the ACM - Amir Pnueli: Ahead of His Time
Frameworks for entity matching: A comparison
Data & Knowledge Engineering
Efficient parallel set-similarity joins using MapReduce
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Evaluation of entity resolution approaches on real-world match problems
Proceedings of the VLDB Endowment
A fast approach for parallel deduplication on multicore processors
Proceedings of the 2011 ACM Symposium on Applied Computing
Efficient entity resolution methods for heterogeneous information spaces
ICDEW '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering Workshops
Effective and efficient entity search in RDF data
ISWC'11 Proceedings of the 10th international conference on The semantic web - Volume Part I
An analysis of structured data on the web
Proceedings of the VLDB Endowment
Information integration over time in unreliable and uncertain environments
Proceedings of the 21st international conference on World Wide Web
Active sampling for entity matching
Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Dedoop: efficient deduplication with Hadoop
Proceedings of the VLDB Endowment
Fast and accurate incremental entity resolution relative to an entity knowledge base
Proceedings of the 21st ACM international conference on Information and knowledge management
Robust runtime optimization and skew-resistant execution of analytical SPARQL queries on pig
ISWC'12 Proceedings of the 11th international conference on The Semantic Web - Volume Part I
Deduplicating a places database
Proceedings of the 23rd international conference on World wide web
Hi-index | 0.00 |
Search, exploration and social experience on the Web has recently undergone tremendous changes with search engines, web portals and social networks offering a different perspective on information discovery and consumption. This new perspective is aimed at capturing user intents, and providing richer and highly connected experiences. The new battleground revolves around technologies for the ingestion, disambiguation and enrichment of entities from a variety of structured and unstructured data sources - we refer to this process as knowledge base synthesis. This paper presents the design, implementation and production deployment of the Web Of Objects (WOO) system, a Hadoop-based platform tackling such challenges. WOO has been designed and implemented to enable various products in Yahoo! to synthesize knowledge bases (KBs) of entities relevant to their domains. Currently, the implementation of WOO we describe is used by various Yahoo! properties such as Intonow, Yahoo! Local, Yahoo! Events and Yahoo! Search. This paper highlights: (i) challenges that arise in designing, building and operating a platform that handles multi-domain, multi-version, and multi-tenant disambiguation of web-scale knowledge bases (hundreds of millions of entities), (ii) the architecture and technical solutions we devised, and (iii) an evaluation on real-world production datasets.