Web trace duplication detection based on context

Authors:
Chang Gao;Xiaoguang Hong;Zhaohui Peng;Hongda Chen
Affiliations:
School of Computer Science and Technology, Shandong University, Jinan, China;School of Computer Science and Technology, Shandong University, Jinan, China and Shandong Dareway Software Co., Ltd., Jinan, China;School of Computer Science and Technology, Shandong University, Jinan, China;School of Computer Science and Technology, Shandong University, Jinan, China
Venue:
WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
Year:
2011

Citing 5
Cited 0

Robust Identification of Fuzzy Duplicates

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Adaptive Name Matching in Information Integration

IEEE Intelligent Systems
Robust record linkage blocking using suffix arrays

Proceedings of the 18th ACM conference on Information and knowledge management
Record Matching over Query Results from Multiple Web Databases

IEEE Transactions on Knowledge and Data Engineering
SemEval-2010 task 17: All-words word sense disambiguation on a specific domain

SemEval '10 Proceedings of the 5th International Workshop on Semantic Evaluation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data Integration becomes more and more important with the rapidly spread of the internet and the study on entity trace becomes more and more important as a part of it. The entity trace is mainly extracted from the text fragments. There will be much duplication in the records because of the large scale, strong autonomy and the high redundancy features of the web sources. The processing of this problem often carries semantic features, which results in that the traditional integration method cannot be applied on it directly. In this paper, we propose a web trace duplication detection method based on unsupervised learning and context. We address the problem above by a new process on computing the comparison vector between two records based on the context, then acquiring the sample data automatically, training the classifiers with the sample data, and finally classifying the records.