Just-in-time analytics on large file systems

Authors:
H. Howie Huang;Nan Zhang;Wei Wang;Gautam Das;Alexander S. Szalay
Affiliations:
George Washington University;George Washington University;George Washington University;University of Texas at Arlington;Johns Hopkins University
Venue:
FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
Year:
2011

Citing 26
Cited 0

Random sampling with a reservoir

ACM Transactions on Mathematical Software (TOMS)
Random sampling from database files: a survey

SSDBM V Proceedings of the fifth international conference on Statistical and scientific database management
Query-based sampling of text databases

ACM Transactions on Information Systems (TOIS)
Simple Random Sampling from Relational Databases

VLDB '86 Proceedings of the 12th International Conference on Very Large Data Bases
Approximate Query Processing: Taming the TeraBytes

Proceedings of the 27th International Conference on Very Large Data Bases
Effective use of block-level sampling in statistics estimation

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
A two-phase sampling technique for information extraction from hidden web databases

Proceedings of the 6th annual ACM international workshop on Web information and data management
Passive NFS Tracing of Email and Research Workloads

FAST '03 Proceedings of the 2nd USENIX Conference on File and Storage Technologies
Sampling, information extraction and summarisation of hidden web databases

Data & Knowledge Engineering - Special issue: WIDM 2004
Optimized stratified sampling for approximate query processing

ACM Transactions on Database Systems (TODS)
Threats to privacy in the forensic analysis of database systems

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
A random walk approach to sampling hidden databases

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
A five-year study of file-system metadata

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Distributed search over the hidden web: hierarchical database sampling and selection

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
HBA: Distributed Metadata Management for Large Cluster-Based Storage Systems

IEEE Transactions on Parallel and Distributed Systems
A survey of top-k query processing techniques in relational database systems

ACM Computing Surveys (CSUR)
New Challenges in Petascale Scientific Databases

SSDBM '08 Proceedings of the 20th international conference on Scientific and Statistical Database Management
Sparse indexing: large scale, inline deduplication using sampling and locality

FAST '09 Proccedings of the 7th conference on File and storage technologies
Spyglass: fast, scalable metadata search for large-scale storage systems

FAST '09 Proccedings of the 7th conference on File and storage technologies
Leveraging COUNT Information in Sampling Hidden Databases

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Privacy preservation of aggregates in hidden databases: why and how?

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Generating realistic impressions for file-system benchmarking

ACM Transactions on Storage (TOS)
SmartStore: a new metadata organization paradigm with semantic-awareness for next-generation file systems

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Fusing data management services with file systems

Proceedings of the 4th Annual Workshop on Petascale Data Storage
Unbiased estimation of size and other aggregates over hidden web databases

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Hierarchical file systems are dead

HotOS'09 Proceedings of the 12th conference on Hot topics in operating systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

As file systems reach the petabytes scale, users and administrators are increasingly interested in acquiring high-level analytical information for file management and analysis. Two particularly important tasks are the processing of aggregate and top-k queries which, unfortunately, cannot be quickly answered by hierarchical file systems such as ext3 and NTFS. Existing pre-processing based solutions, e.g., file system crawling and index building, consume a significant amount of time and space (for generating and maintaining the indexes) which in many cases cannot be justified by the infrequent usage of such solutions. In this paper, we advocate that user interests can often be sufficiently satisfied by approximate - i.e., statistically accurate - answers. We develop Glance, a just-in-time sampling-based system which, after consuming a small number of disk accesses, is capable of producing extremely accurate answers for a broad class of aggregate and top-k queries over a file system without the requirement of any prior knowledge. We use a number of real-world file systems to demonstrate the efficiency, accuracy and scalability of Glance.