Random sampling with a reservoir
ACM Transactions on Mathematical Software (TOMS)
Random sampling from database files: a survey
SSDBM V Proceedings of the fifth international conference on Statistical and scientific database management
Query-based sampling of text databases
ACM Transactions on Information Systems (TOIS)
Simple Random Sampling from Relational Databases
VLDB '86 Proceedings of the 12th International Conference on Very Large Data Bases
Approximate Query Processing: Taming the TeraBytes
Proceedings of the 27th International Conference on Very Large Data Bases
Effective use of block-level sampling in statistics estimation
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
A two-phase sampling technique for information extraction from hidden web databases
Proceedings of the 6th annual ACM international workshop on Web information and data management
Passive NFS Tracing of Email and Research Workloads
FAST '03 Proceedings of the 2nd USENIX Conference on File and Storage Technologies
Sampling, information extraction and summarisation of hidden web databases
Data & Knowledge Engineering - Special issue: WIDM 2004
Optimized stratified sampling for approximate query processing
ACM Transactions on Database Systems (TODS)
Threats to privacy in the forensic analysis of database systems
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
A random walk approach to sampling hidden databases
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
A five-year study of file-system metadata
FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Distributed search over the hidden web: hierarchical database sampling and selection
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
HBA: Distributed Metadata Management for Large Cluster-Based Storage Systems
IEEE Transactions on Parallel and Distributed Systems
A survey of top-k query processing techniques in relational database systems
ACM Computing Surveys (CSUR)
New Challenges in Petascale Scientific Databases
SSDBM '08 Proceedings of the 20th international conference on Scientific and Statistical Database Management
Sparse indexing: large scale, inline deduplication using sampling and locality
FAST '09 Proccedings of the 7th conference on File and storage technologies
Spyglass: fast, scalable metadata search for large-scale storage systems
FAST '09 Proccedings of the 7th conference on File and storage technologies
Leveraging COUNT Information in Sampling Hidden Databases
ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Privacy preservation of aggregates in hidden databases: why and how?
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Generating realistic impressions for file-system benchmarking
ACM Transactions on Storage (TOS)
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Fusing data management services with file systems
Proceedings of the 4th Annual Workshop on Petascale Data Storage
Unbiased estimation of size and other aggregates over hidden web databases
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Hierarchical file systems are dead
HotOS'09 Proceedings of the 12th conference on Hot topics in operating systems
Hi-index | 0.00 |
As file systems reach the petabytes scale, users and administrators are increasingly interested in acquiring high-level analytical information for file management and analysis. Two particularly important tasks are the processing of aggregate and top-k queries which, unfortunately, cannot be quickly answered by hierarchical file systems such as ext3 and NTFS. Existing pre-processing based solutions, e.g., file system crawling and index building, consume a significant amount of time and space (for generating and maintaining the indexes) which in many cases cannot be justified by the infrequent usage of such solutions. In this paper, we advocate that user interests can often be sufficiently satisfied by approximate - i.e., statistically accurate - answers. We develop Glance, a just-in-time sampling-based system which, after consuming a small number of disk accesses, is capable of producing extremely accurate answers for a broad class of aggregate and top-k queries over a file system without the requirement of any prior knowledge. We use a number of real-world file systems to demonstrate the efficiency, accuracy and scalability of Glance.