Fully automatic cross-associations

Authors:
Deepayan Chakrabarti;Spiros Papadimitriou;Dharmendra S. Modha;Christos Faloutsos
Affiliations:
Carnegie Mellon University, Pittsburgh, PA;Carnegie Mellon University, Pittsburgh, PA;IBM Almaden Research Center, San Jose, CA;Carnegie Mellon University, Pittsburgh, PA
Venue:
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2004

Citing 20
Cited 55

Arithmetic coding for data compression

Communications of the ACM
BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Latent semantic indexing: a probabilistic analysis

PODS '98 Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
CURE: an efficient clustering algorithm for large databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
A semidiscrete matrix decomposition for latent semantic indexing information retrieval

ACM Transactions on Information Systems (TOIS)
Probabilistic latent semantic indexing

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Small worlds: the dynamics of networks between order and randomness

Small worlds: the dynamics of networks between order and randomness
Data mining: concepts and techniques

Data mining: concepts and techniques
Multilevel algorithms for multi-constraint graph partitioning

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Concept Decompositions for Large Sparse Text Data Using Clustering

Machine Learning
Chameleon: Hierarchical Clustering Using Dynamic Modeling

Computer
Identifying Web Browsing Trends and Patterns

Computer
X-means: Extending K-means with Efficient Estimation of the Number of Clusters

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Multivariate Information Bottleneck

UAI '01 Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence
Handling very large numbers of association rules in the analysis of microarray data

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
An Approach to Relate the Web Communities through Bipartite Graphs

WISE '01 Proceedings of the Second International Conference on Web Information Systems Engineering (WISE'01) Volume 1 - Volume 1
Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach

Data Mining and Knowledge Discovery
Information-theoretic co-clustering

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Mining multiple phenotype structures underlying gene expression profiles

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management

BLINC: multilevel traffic classification in the dark

Proceedings of the 2005 conference on Applications, technologies, architectures, and protocols for computer communications
Parameter-Free Spatial Data Mining Using MDL

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Neighborhood Formation and Anomaly Detection in Bipartite Graphs

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Relevance search and anomaly detection in bipartite graphs

ACM SIGKDD Explorations Newsletter
Graph mining: Laws, generators, and algorithms

ACM Computing Surveys (CSUR)
Robust information-theoretic clustering

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
LinkClus: efficient clustering via heterogeneous semantic links

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Efficiently clustering transactional data with weighted coverage density

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Compression-based data mining of sequential data

Data Mining and Knowledge Discovery
Trajectory clustering: a partition-and-group framework

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Predictive discrete latent factor models for large scale dyadic data

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
GraphScope: parameter-free mining of large time-evolving graphs

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
RIC: Parameter-free noise-robust clustering

ACM Transactions on Knowledge Discovery from Data (TKDD)
Discovering global network communities based on local centralities

ACM Transactions on the Web (TWEB)
CRD: fast co-clustering on large datasets utilizing sampling-based matrix decomposition

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Hierarchical, Parameter-Free Community Discovery

ECML PKDD '08 Proceedings of the European conference on Machine Learning and Knowledge Discovery in Databases - Part II
Determining the best K for clustering transactional datasets: A coverage density-based approach

Data & Knowledge Engineering
Information Theoretic Comparison of Stochastic Graph Models: Some Experiments

WAW '09 Proceedings of the 6th International Workshop on Algorithms and Models for the Web-Graph
Estimating the number of frequent itemsets in a large database

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Applying latent dirichlet allocation to group discovery in large graphs

Proceedings of the 2009 ACM symposium on Applied Computing
Automatic discovery of botnet communities on large-scale communication networks

Proceedings of the 4th International Symposium on Information, Computer, and Communications Security
Unveiling core network-wide communication patterns through application traffic activity graph decomposition

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
“Best K”: critical clustering structures in categorical datasets

Knowledge and Information Systems
HE-Tree: a framework for detecting changes in clustering structure for categorical data streams

The VLDB Journal — The International Journal on Very Large Data Bases
SCALE: a scalable framework for efficiently clustering transactional data

Data Mining and Knowledge Discovery
A fast and compact web graph representation

SPIRE'07 Proceedings of the 14th international conference on String processing and information retrieval
Browsing an image database utilizing the associations between images and features

ICIP'09 Proceedings of the 16th IEEE international conference on Image processing
Predictive blacklisting as an implicit recommendation system

INFOCOM'10 Proceedings of the 29th conference on Information communications
Metric forensics: a multi-level approach for mining volatile graphs

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
Unifying dependent clustering and disparate clustering for non-homogeneous data

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
Fast and Compact Web Graph Representations

ACM Transactions on the Web (TWEB)
Profiling users in a 3g network using hourglass co-clustering

Proceedings of the sixteenth annual international conference on Mobile computing and networking
The impact of unlinkability on adversarial community detection: effects and countermeasures

PETS'10 Proceedings of the 10th international conference on Privacy enhancing technologies
Profiling-By-Association: a resilient traffic profiling solution for the internet backbone

Proceedings of the 6th International COnference
HADI: Mining Radii of Large Graphs

ACM Transactions on Knowledge Discovery from Data (TKDD)
Krimp: mining itemsets that compress

Data Mining and Knowledge Discovery
BitShred: feature hashing malware for scalable triage and semantic analysis

Proceedings of the 18th ACM conference on Computer and communications security
A parameter-free method for discovering generalized clusters in a network

DS'11 Proceedings of the 14th international conference on Discovery science
LOCAR: local compression of alternative routes

Proceedings of the 19th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems
A compression-boosting transform for two-dimensional data

AAIM'06 Proceedings of the Second international conference on Algorithmic Aspects in Information and Management
Significance and recovery of block structures in binary matrices with noise

COLT'06 Proceedings of the 19th annual conference on Learning Theory
An MDL approach to efficiently discover communities in bipartite network

DASFAA'10 Proceedings of the 15th international conference on Database Systems for Advanced Applications - Volume Part I
EigenSpokes: surprising patterns and scalable community chipping in large graphs

PAKDD'10 Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part II
Hierarchical clustering and outlier detection for effective image data organization

Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication
Tripartite community structure in social bookmarking data

The New Review of Hypermedia and Multimedia - Special issue on Social Linking and Hypermedia
Summarization-based mining bipartite graphs

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Unsupervised sparse matrix co-clustering for marketing and sales intelligence

PAKDD'12 Proceedings of the 16th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
gbase: an efficient analysis platform for large graphs

The VLDB Journal — The International Journal on Very Large Data Bases
Mining coherent anomaly collections on web data

Proceedings of the 21st ACM international conference on Information and knowledge management
Summarizing categorical data by clustering attributes

Data Mining and Knowledge Discovery
Parameter-less co-clustering for star-structured heterogeneous data

Data Mining and Knowledge Discovery
Link Prediction for Bipartite Social Networks: The Role of Structural Holes

ASONAM '12 Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012)
How to "alternatize" a clustering algorithm

Data Mining and Knowledge Discovery
Hierarchical co-clustering: off-line and incremental approaches

Data Mining and Knowledge Discovery
RoClust: Role discovery for graph clustering

Web Intelligence and Agent Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Large, sparse binary matrices arise in numerous data mining applications, such as the analysis of market baskets, web graphs, social networks, co-citations, as well as information retrieval, collaborative filtering, sparse matrix reordering, etc. Virtually all popular methods for the analysis of such matrices---e.g., k-means clustering, METIS graph partitioning, SVD/PCA and frequent itemset mining---require the user to specify various parameters, such as the number of clusters, number of principal components, number of partitions, and "support." Choosing suitable values for such parameters is a challenging problem.Cross-association is a joint decomposition of a binary matrix into disjoint row and column groups such that the rectangular intersections of groups are homogeneous. Starting from first principles, we furnish a clear, information-theoretic criterion to choose a good cross-association as well as its parameters, namely, the number of row and column groups. We provide scalable algorithms to approach the optimal. Our algorithm is parameter-free, and requires no user intervention. In practice it scales linearly with the problem size, and is thus applicable to very large matrices. Finally, we present experiments on multiple synthetic and real-life datasets, where our method gives high-quality, intuitive results.