Min-wise independent permutations (extended abstract)
STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Efficient mining of association rules using closed itemset lattices
Information Systems
Finding replicated Web collections
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Collection statistics for fast duplicate document detection
ACM Transactions on Information Systems (TOIS)
Evaluating strategies for similarity search on the web
Proceedings of the 11th international conference on World Wide Web
Formal Concept Analysis: Mathematical Foundations
Formal Concept Analysis: Mathematical Foundations
Finding Near-Replicas of Documents and Servers on the Web
WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Identifying and Filtering Near-Duplicate Documents
COM '00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching
On the Resemblance and Containment of Documents
SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Improved robustness of signature-based near-replica detection via lexicon randomization
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
BicAT: a biclustering analysis toolbox
Bioinformatics
Efficient similarity joins for near duplicate detection
Proceedings of the 17th international conference on World Wide Web
Text mining scientific papers: a survey on FCA-Based information retrieval research
ICDM'12 Proceedings of the 12th Industrial conference on Advances in Data Mining: applications and theoretical aspects
Near-Duplicate detection for online-shops owners: an FCA-Based approach
ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval
MFIBlocks: An effective blocking algorithm for entity resolution
Information Systems
Review: Formal concept analysis in knowledge processing: A survey on applications
Expert Systems with Applications: An International Journal
Hi-index | 0.00 |
A vast amount of documents in the Web have duplicates, which is a challenge for developing efficient methods that would compute clusters of similar documents. In this paper we use an approach based on computing (closed) sets of attributes having large support (large extent) as clusters of similar documents. The method is tested in a series of computer experiments on large public collections of web documents and compared to other established methods and software, such as biclustering, on same datasets. Practical efficiency of different algorithms for computing frequent closed sets of attributes is compared.