The merge/purge problem for large databases
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Jaccard similarity leads to the Marczewski-Steinhaus topology for information retrieval
Information Processing and Management: an International Journal
Entity Identification in Database Integration
Proceedings of the Ninth International Conference on Data Engineering
Declarative Data Cleaning: Language, Model, and Algorithms
Proceedings of the 27th International Conference on Very Large Data Bases
Efficient Record Linkage in Large Data Sets
DASFAA '03 Proceedings of the Eighth International Conference on Database Systems for Advanced Applications
Detecting duplicate objects in XML documents
Proceedings of the 2004 international workshop on Information quality in information systems
Reference reconciliation in complex information spaces
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Relational clustering for multi-type entity resolution
MRDM '05 Proceedings of the 4th international workshop on Multi-relational mining
Scientific data management in the coming decade
ACM SIGMOD Record
Eliminating fuzzy duplicates in data warehouses
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
A Weighting Fuzzy Clustering Algorithm Based on Euclidean Distance
FSKD '08 Proceedings of the 2008 Fifth International Conference on Fuzzy Systems and Knowledge Discovery - Volume 01
Hi-index | 0.00 |
The duplicate detection is one of technical difficulties in data cleaning area. At present, the data volume of scientific database is increasing rapidly, bringing new challenges to the duplicate detection. In the scientific database, the duplicate detection model should be suitable for massive and numerical data, should independent from the domains, should well consider the relationships among tables, and should focus on common grounds of the scientific database. In the paper, a multilevel duplicate detection model for scientific database is proposed, which consider numerical data and general usage well. Firstly, the challenges are propose by analyzing duplicate-related characteristics of scientific data; Secondly, similarity measure of the proposed model are defined; Then the details of multilevel detecting algorithms are introduced; At last, some experiments and applications show that the proposed model is more domain-independent and effective, suitable for duplicate detection in scientific database.