A model of uncertainty for near-duplicates in document reference networks

Authors:
Claudia Hess;Michel De Rougemont
Affiliations:
Laboratory for Semantic Information Technology, Bamberg University;LRI, Universit Paris-Sud
Venue:
ECDL'07 Proceedings of the 11th European conference on Research and Advanced Technology for Digital Libraries
Year:
2007

Citing 5
Cited 0

Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Finding Near-Replicas of Documents and Servers on the Web

WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Identifying and Filtering Near-Duplicate Documents

COM '00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching
Comparing and aggregating rankings with ties

PODS '04 Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Clean Answers over Dirty Databases: A Probabilistic Approach

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

We introduce a model of uncertainty where documents are not uniquely identified in a reference network, and some links may be incorrect. It generalizes the probabilistic approach on databases to graphs, and defines subgraphs with a probability distribution. The answer to a relational query is a distribution of documents, and we study how to approximate the ranking of the most likely documents and quantify the quality of the approximation. The answer to a function query is a distribution of values and we consider the size of the interval of Minimum and Maximum values as a measure for the precision of the answer.