A bloat-aware design for big data applications

Authors:
Yingyi Bu;Vinayak Borkar;Guoqing Xu;Michael J. Carey
Affiliations:
University of California, Irvine, Irvine, California, USA;University of California, Irvine, Irvine, California, USA;University of California, Irvine, Irvine, California, USA;University of California, Irvine, Irvine, California, USA
Venue:
Proceedings of the 2013 international symposium on memory management
Year:
2013

Citing 36
Cited 0

Query evaluation techniques for large databases

ACM Computing Surveys (CSUR)
Implementation of the typed call-by-value λ-calculus using a stack of regions

POPL '94 Proceedings of the 21st ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Better static memory management: improving region-based analysis of higher-order languages

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Memory management with explicit regions

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
A region-based memory manager for prolog

Proceedings of the 2nd international symposium on Memory management
Language support for regions

Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
Combining region inference and garbage collection

PLDI '02 Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation
Region-based memory management in cyclone

PLDI '02 Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation
Database Management Systems

Database Management Systems
Ensuring code safety without runtime checks for real-time control systems

CASES '02 Proceedings of the 2002 international conference on Compilers, architecture, and synthesis for embedded systems
Java support for data-intensive systems: experiences building the telegraph dataflow system

ACM SIGMOD Record
An Implementation of Scoped Memory for Real-Time Java

EMSOFT '01 Proceedings of the First International Workshop on Embedded Software
Ownership types for safe region-based memory management in real-time Java

PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
Experience with safe manual memory-management in cyclone

Proceedings of the 4th international symposium on Memory management
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
The causes of bloat, the limits of health

Proceedings of the 22nd annual ACM SIGPLAN conference on Object-oriented programming systems and applications
Precise memory leak detection for java software using container profiling

Proceedings of the 30th international conference on Software engineering
Immix: a mark-region garbage collector with space efficiency, fast collection, and mutator performance

Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Jolt: lightweight dynamic analysis and removal of object churn

Proceedings of the 23rd ACM SIGPLAN conference on Object-oriented programming systems languages and applications
A scalable technique for characterizing the usage of temporaries in framework-intensive Java applications

Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of software engineering
Go with the flow: profiling copies to find runtime bloat

Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
Making Sense of Large Heaps

Genoa Proceedings of the 23rd European Conference on ECOOP 2009 --- Object-Oriented Programming
Four Trends Leading to Java Runtime Bloat

IEEE Software
Detecting inefficiently-used containers to avoid bloat

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Finding low-utility data structures

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
FlumeJava: easy, efficient data-parallel pipelines

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Pregel: a system for large-scale graph processing

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Spark: cluster computing with working sets

HotCloud'10 Proceedings of the 2nd USENIX conference on Hot topics in cloud computing
Performance analysis of idle programs

Proceedings of the ACM international conference on Object oriented programming systems languages and applications
Software bloat analysis: finding, removing, and preventing performance problems in modern large-scale object-oriented applications

Proceedings of the FSE/SDP workshop on Future of software engineering research
ASTERIX: towards a scalable, semistructured data platform for evolving-world models

Distributed and Parallel Databases
Hyracks: A flexible and extensible foundation for data-intensive computing

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
Modeling runtime behavior in framework-based applications

ECOOP'06 Proceedings of the 20th European conference on Object-Oriented Programming
Static detection of loop-invariant data structures

ECOOP'12 Proceedings of the 26th European conference on Object-Oriented Programming
Finding reusable data structures

Proceedings of the ACM international conference on Object oriented programming systems languages and applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Over the past decade, the increasing demands on data-driven business intelligence have led to the proliferation of large-scale, data-intensive applications that often have huge amounts of data (often at terabyte or petabyte scale) to process. An object-oriented programming language such as Java is often the developer's choice for implementing such applications, primarily due to its quick development cycle and rich community resource. While the use of such languages makes programming easier, significant performance problems can often be seen --- the combination of the inefficiencies inherent in a managed run-time system and the impact of the huge amount of data to be processed in the limited memory space often leads to memory bloat and performance degradation at a surprisingly early stage. This paper proposes a bloat-aware design paradigm towards the development of efficient and scalable Big Data applications in object-oriented GC enabled languages. To motivate this work, we first perform a study on the impact of several typical memory bloat patterns. These patterns are summarized from the user complaints on the mailing lists of two widely-used open-source Big Data applications. Next, we discuss our design paradigm to eliminate bloat. Using examples and real-world experience, we demonstrate that programming under this paradigm does not incur significant programming burden. We have implemented a few common data processing tasks both using this design and using the conventional object-oriented design. Our experimental results show that this new design paradigm is extremely effective in improving performance --- even for the moderate-size data sets processed, we have observed 2.5x+ performance gains, and the improvement grows substantially with the size of the data set.