Duplicate identification in deep web data integration

Authors:
Wei Liu;Xiaofeng Meng;Jianwu Yang;Jianguo Xiao
Affiliations:
Institute of Computer Science & Technology, Peking University, Beijing, China;School of Information, Renmin University of China, Beijing, China;Institute of Computer Science & Technology, Peking University, Beijing, China;Institute of Computer Science & Technology, Peking University, Beijing, China
Venue:
WAIM'10 Proceedings of the 11th international conference on Web-age information management
Year:
2010

Citing 13
Cited 1

Data integration using similarity joins and a word-based information representation language

ACM Transactions on Information Systems (TOIS)
Declarative Data Cleaning: Language, Model, and Algorithms

Proceedings of the 27th International Conference on Very Large Data Bases
Efficient set joins on similarity predicates

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Structured databases on the web: observations and implications

ACM SIGMOD Record
Markov logic networks

Machine Learning
Adaptive Name Matching in Information Integration

IEEE Intelligent Systems
Record linkage: similarity measures and algorithms

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Efficient exact set-similarity joins

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Scaling up all pairs similarity search

Proceedings of the 16th international conference on World Wide Web
Example-driven design of efficient record matching queries

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Efficient similarity joins for near duplicate detection

Proceedings of the 17th international conference on World Wide Web
Joint inference in information extraction

AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 1

Exploiting attribute redundancy for web entity data extraction

ICADL'11 Proceedings of the 13th international conference on Asia-pacific digital libraries: for cultural heritage, knowledge dissemination, and future creation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Duplicate identification is a critical step in deep web data integration, and generally, this task has to be performed over multiple web databases. However, a customized matcher for two web databases often does not work well for other two ones due to various presentations and different schemas. It is not practical to build and maintain Cn2 matchers for n web databases. In this paper, we target at building one universal matcher over multiple web databases in one domain. According to our observation, the similarity on an attribute is dependent of those of some other attributes, which is ignored by existing approaches. Inspired by this, we propose a comprehensive solution for duplicate identification problem over multiple web databases. The extensive experiments over real web databases on three domains show the proposed solution is an effective way to address the duplicate identification problem over multiple web databases.