DisCo: Distributed Co-clustering with Map-Reduce: A Case Study towards Petabyte-Scale End-to-End Mining

Authors:
Spiros Papadimitriou;Jimeng Sun
Affiliations:
-;-
Venue:
ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
Year:
2008

Citing 0
Cited 23

Social influence analysis in large-scale networks

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
DisTec: Towards a Distributed System for Telecom Computing

CloudCom '09 Proceedings of the 1st International Conference on Cloud Computing
Tuning the capacity of search engines: Load-driven routing and incremental caching to reduce and balance the load

ACM Transactions on Information Systems (TOIS)
Max-cover in map-reduce

Proceedings of the 19th international conference on World wide web
Scalable clustering algorithm for N-body simulations in a shared-nothing cluster

SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
HADI: Mining Radii of Large Graphs

ACM Transactions on Knowledge Discovery from Data (TKDD)
A load-aware scheduler for MapReduce framework in heterogeneous cloud environments

Proceedings of the 2011 ACM Symposium on Applied Computing
A unified representation of web logs for mining applications

Information Retrieval
GBASE: a scalable and general graph management system

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
BitShred: feature hashing malware for scalable triage and semantic analysis

Proceedings of the 18th ACM conference on Computer and communications security
A Map-Reduce Based Framework for Heterogeneous Processing Element Cluster Environments

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Personalized news recommendation: a review and an experimental investigation

Journal of Computer Science and Technology - Special issue on Community Analysis and Information Recommendation
Unsupervised sparse matrix co-clustering for marketing and sales intelligence

PAKDD'12 Proceedings of the 16th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
MapReduce algorithms for big data analysis

Proceedings of the VLDB Endowment
gbase: an efficient analysis platform for large graphs

The VLDB Journal — The International Journal on Very Large Data Bases
Multimedia Applications and Security in MapReduce: Opportunities and Challenges

Concurrency and Computation: Practice & Experience
Simulation of database-valued markov chains using SimSQL

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Big graph mining: algorithms and discoveries

ACM SIGKDD Explorations Newsletter
CopyCatch: stopping group attacks by spotting lockstep behavior in social networks

Proceedings of the 22nd international conference on World Wide Web
Prolog programming with a map-reduce parallel construct

Proceedings of the 15th Symposium on Principles and Practice of Declarative Programming
The family of mapreduce and large-scale data processing systems

ACM Computing Surveys (CSUR)
A fast algorithm for clustering with mapreduce

ISNN'13 Proceedings of the 10th international conference on Advances in Neural Networks - Volume Part I
Achieving Accountable MapReduce in cloud computing

Future Generation Computer Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Huge datasets are becoming prevalent; even as researchers, we now routinely have to work with datasets that are up to a few terabytes in size. Interesting real-world applications produce huge volumes of messy data. The mining process involves several steps, starting from pre-processing the raw data to estimating the final models. As data become more abundant, scalable and easy-to-use tools for distributed processing are also emerging. Among those, Map-Reduce has been widely embraced by both academia and industry. In database terms, Map-Reduce is a simple yet powerful execution engine, which can be complemented with other data storage and management components, as necessary. In this paper we describe our experiences and findings in applying Map-Reduce, from raw data to final models, on an important mining task. In particular, we focus on co-clustering, which has been studied in many applications such as text mining, collaborative filtering, bio-informatics, graph mining. We propose the Distributed Co-clustering (DisCo) framework, which introduces practical approaches for distributed data pre-processing, and co-clustering. We develop DisCo using Hadoop, an open source Map-Reduce implementation. We show that DisCo can scale well and efficiently process and analyze extremely large datasets (up to several hundreds of gigabytes) on commodity hardware.