Entity matching for semistructured data in the Cloud

Authors:
Marcus Paradies;Susan Malaika;Jérôme Siméon;Shahan Khatchadourian;Kai-Uwe Sattler
Affiliations:
Ilmenau Univ. of Technology;IBM Software Group;IBM Watson Research;Univ. of Toronto;Ilmenau Univ. of Technology
Venue:
Proceedings of the 27th Annual ACM Symposium on Applied Computing
Year:
2012

Citing 6
Cited 0

Autonomous citation matching

Proceedings of the third annual conference on Autonomous Agents
Efficient clustering of high-dimensional data sets with application to reference matching

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem

Data Mining and Knowledge Discovery
Efficient parallel set-similarity joins using MapReduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Learning-based entity resolution with MapReduce

Proceedings of the third international workshop on Cloud data management
ChuQL: processing XML with XQuery using Hadoop

Proceedings of the 2011 Conference of the Center for Advanced Studies on Collaborative Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

The rapid expansion of available information, on the Web or inside companies, is increasing. With Cloud infrastructure maturing (including tools for parallel data processing, text analytics, clustering, etc.), there is more interest in integrating data to produce higher-value content. New challenges, notably include entity matching over large volumes of heterogeneous data. In this paper, we describe an approach for entity matching over large amounts of semistructured data in the Cloud. The approach combines ChuQL[4], a recently proposed extension of XQuery with MapReduce, and a blocking technique for entity matching which can be efficiently executed on top of MapReduce. We illustrate the proposed approach by applying it to extract automatically and enrich references in Wikipedia and report on an experimental evaluation of the approach.