Frequent Itemset Mining for Clustering Near Duplicate Web Documents

Authors:
Dmitry I. Ignatov;Sergei O. Kuznetsov
Affiliations:
Department of Applied Mathematics, Higher School of Economics, Moscow, Russia 105679;Department of Applied Mathematics, Higher School of Economics, Moscow, Russia 105679
Venue:
ICCS '09 Proceedings of the 17th International Conference on Conceptual Structures: Conceptual Structures: Leveraging Semantic Technologies
Year:
2009

Citing 13
Cited 4

Min-wise independent permutations (extended abstract)

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Efficient mining of association rules using closed itemset lattices

Information Systems
Finding replicated Web collections

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Collection statistics for fast duplicate document detection

ACM Transactions on Information Systems (TOIS)
Evaluating strategies for similarity search on the web

Proceedings of the 11th international conference on World Wide Web
Formal Concept Analysis: Mathematical Foundations

Formal Concept Analysis: Mathematical Foundations
Finding Near-Replicas of Documents and Servers on the Web

WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Identifying and Filtering Near-Duplicate Documents

COM '00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering

Machine Learning
Improved robustness of signature-based near-replica detection via lexicon randomization

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
BicAT: a biclustering analysis toolbox

Bioinformatics
Efficient similarity joins for near duplicate detection

Proceedings of the 17th international conference on World Wide Web

Text mining scientific papers: a survey on FCA-Based information retrieval research

ICDM'12 Proceedings of the 12th Industrial conference on Advances in Data Mining: applications and theoretical aspects
Near-Duplicate detection for online-shops owners: an FCA-Based approach

ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval
MFIBlocks: An effective blocking algorithm for entity resolution

Information Systems
Review: Formal concept analysis in knowledge processing: A survey on applications

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

A vast amount of documents in the Web have duplicates, which is a challenge for developing efficient methods that would compute clusters of similar documents. In this paper we use an approach based on computing (closed) sets of attributes having large support (large extent) as clusters of similar documents. The method is tested in a series of computer experiments on large public collections of web documents and compared to other established methods and software, such as biclustering, on same datasets. Practical efficiency of different algorithms for computing frequent closed sets of attributes is compared.