Frequent Itemset Mining for Clustering Near Duplicate Web Documents

  • Authors:
  • Dmitry I. Ignatov;Sergei O. Kuznetsov

  • Affiliations:
  • Department of Applied Mathematics, Higher School of Economics, Moscow, Russia 105679;Department of Applied Mathematics, Higher School of Economics, Moscow, Russia 105679

  • Venue:
  • ICCS '09 Proceedings of the 17th International Conference on Conceptual Structures: Conceptual Structures: Leveraging Semantic Technologies
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

A vast amount of documents in the Web have duplicates, which is a challenge for developing efficient methods that would compute clusters of similar documents. In this paper we use an approach based on computing (closed) sets of attributes having large support (large extent) as clusters of similar documents. The method is tested in a series of computer experiments on large public collections of web documents and compared to other established methods and software, such as biclustering, on same datasets. Practical efficiency of different algorithms for computing frequent closed sets of attributes is compared.