Social influence analysis in large-scale networks
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
DisTec: Towards a Distributed System for Telecom Computing
CloudCom '09 Proceedings of the 1st International Conference on Cloud Computing
ACM Transactions on Information Systems (TOIS)
Proceedings of the 19th international conference on World wide web
Scalable clustering algorithm for N-body simulations in a shared-nothing cluster
SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
HADI: Mining Radii of Large Graphs
ACM Transactions on Knowledge Discovery from Data (TKDD)
A load-aware scheduler for MapReduce framework in heterogeneous cloud environments
Proceedings of the 2011 ACM Symposium on Applied Computing
A unified representation of web logs for mining applications
Information Retrieval
GBASE: a scalable and general graph management system
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
BitShred: feature hashing malware for scalable triage and semantic analysis
Proceedings of the 18th ACM conference on Computer and communications security
A Map-Reduce Based Framework for Heterogeneous Processing Element Cluster Environments
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Personalized news recommendation: a review and an experimental investigation
Journal of Computer Science and Technology - Special issue on Community Analysis and Information Recommendation
Unsupervised sparse matrix co-clustering for marketing and sales intelligence
PAKDD'12 Proceedings of the 16th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
MapReduce algorithms for big data analysis
Proceedings of the VLDB Endowment
gbase: an efficient analysis platform for large graphs
The VLDB Journal — The International Journal on Very Large Data Bases
Multimedia Applications and Security in MapReduce: Opportunities and Challenges
Concurrency and Computation: Practice & Experience
Simulation of database-valued markov chains using SimSQL
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Big graph mining: algorithms and discoveries
ACM SIGKDD Explorations Newsletter
CopyCatch: stopping group attacks by spotting lockstep behavior in social networks
Proceedings of the 22nd international conference on World Wide Web
Prolog programming with a map-reduce parallel construct
Proceedings of the 15th Symposium on Principles and Practice of Declarative Programming
The family of mapreduce and large-scale data processing systems
ACM Computing Surveys (CSUR)
A fast algorithm for clustering with mapreduce
ISNN'13 Proceedings of the 10th international conference on Advances in Neural Networks - Volume Part I
Achieving Accountable MapReduce in cloud computing
Future Generation Computer Systems
Hi-index | 0.00 |
Huge datasets are becoming prevalent; even as researchers, we now routinely have to work with datasets that are up to a few terabytes in size. Interesting real-world applications produce huge volumes of messy data. The mining process involves several steps, starting from pre-processing the raw data to estimating the final models. As data become more abundant, scalable and easy-to-use tools for distributed processing are also emerging. Among those, Map-Reduce has been widely embraced by both academia and industry. In database terms, Map-Reduce is a simple yet powerful execution engine, which can be complemented with other data storage and management components, as necessary. In this paper we describe our experiences and findings in applying Map-Reduce, from raw data to final models, on an important mining task. In particular, we focus on co-clustering, which has been studied in many applications such as text mining, collaborative filtering, bio-informatics, graph mining. We propose the Distributed Co-clustering (DisCo) framework, which introduces practical approaches for distributed data pre-processing, and co-clustering. We develop DisCo using Hadoop, an open source Map-Reduce implementation. We show that DisCo can scale well and efficiently process and analyze extremely large datasets (up to several hundreds of gigabytes) on commodity hardware.