A multilevel and domain-independent duplicate detection model for scientific database

Authors:
Jie Song;Yubin Bao;Ge Yu
Affiliations:
Northeastern University, Shenyang, China;Northeastern University, Shenyang, China;Northeastern University, Shenyang, China
Venue:
WAIM'10 Proceedings of the 11th international conference on Web-age information management
Year:
2010

Citing 11
Cited 0

The merge/purge problem for large databases

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Jaccard similarity leads to the Marczewski-Steinhaus topology for information retrieval

Information Processing and Management: an International Journal
Entity Identification in Database Integration

Proceedings of the Ninth International Conference on Data Engineering
Declarative Data Cleaning: Language, Model, and Algorithms

Proceedings of the 27th International Conference on Very Large Data Bases
Efficient Record Linkage in Large Data Sets

DASFAA '03 Proceedings of the Eighth International Conference on Database Systems for Advanced Applications
Detecting duplicate objects in XML documents

Proceedings of the 2004 international workshop on Information quality in information systems
Reference reconciliation in complex information spaces

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Relational clustering for multi-type entity resolution

MRDM '05 Proceedings of the 4th international workshop on Multi-relational mining
Scientific data management in the coming decade

ACM SIGMOD Record
Eliminating fuzzy duplicates in data warehouses

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
A Weighting Fuzzy Clustering Algorithm Based on Euclidean Distance

FSKD '08 Proceedings of the 2008 Fifth International Conference on Fuzzy Systems and Knowledge Discovery - Volume 01

Quantified Score

Hi-index	0.00

Visualization

Abstract

The duplicate detection is one of technical difficulties in data cleaning area. At present, the data volume of scientific database is increasing rapidly, bringing new challenges to the duplicate detection. In the scientific database, the duplicate detection model should be suitable for massive and numerical data, should independent from the domains, should well consider the relationships among tables, and should focus on common grounds of the scientific database. In the paper, a multilevel duplicate detection model for scientific database is proposed, which consider numerical data and general usage well. Firstly, the challenges are propose by analyzing duplicate-related characteristics of scientific data; Secondly, similarity measure of the proposed model are defined; Then the details of multilevel detecting algorithms are introduced; At last, some experiments and applications show that the proposed model is more domain-independent and effective, suitable for duplicate detection in scientific database.