XPathLearner: an on-line self-tuning Markov histogram for XML path selectivity estimation

Authors:
Lipyeow Lim;Min Wang;Sriram Padmanabhan;Jeffrey Scott Vitter;Ronald Parr
Affiliations:
Department of Computer Science, Duke University, Durham, NC;IBM T. J. Watson Research Center, Hawthorne NY;IBM T. J. Watson Research Center, Hawthorne NY;Department of Computer Science, Duke University, Durham, NC;Department of Computer Science, Duke University, Durham, NC
Venue:
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Year:
2002

Citing 10
Cited 29

Learning internal representations by error propagation

Parallel distributed processing: explorations in the microstructure of cognition, vol. 1
On the learnability of discrete distributions

STOC '94 Proceedings of the twenty-sixth annual ACM symposium on Theory of computing
Estimating alphanumeric selectivity in the presence of wildcards

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Substring selectivity estimation

PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Self-tuning histograms: building histograms without looking at data

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
STHoles: a multidimensional workload-aware histogram

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Access path selection in a relational database management system

SIGMOD '79 Proceedings of the 1979 ACM SIGMOD international conference on Management of data
Counting Twig Matches in a Tree

Proceedings of the 17th International Conference on Data Engineering
Query Optimization for XML

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Proceedings of the 27th International Conference on Very Large Data Bases

Building XML statistics for the hidden web

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Selectivity Estimation for XML Twigs

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Selectivity Estimation for String Predicates: Overcoming the Underestimation Problem

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Approximate XML query answers

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
IMAX: Incremental Maintenance of Schema-Based XML Statistics

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Indexing Useful Structural Patterns for XML Query Processing

IEEE Transactions on Knowledge and Data Engineering
Selectivity estimation for fuzzy string predicates in large data sets

VLDB '05 Proceedings of the 31st international conference on Very large data bases
An efficient and versatile query engine for TopX search

VLDB '05 Proceedings of the 31st international conference on Very large data bases
CXHist: an on-line classification-based histogram for XML string selectivity estimation

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Cost-based optimization in DB2 XML

IBM Systems Journal
XSKETCH synopses for XML data graphs

ACM Transactions on Database Systems (TODS)
The history of histograms (abridged)

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Bloom histogram: path selectivity estimation for XML data with updates

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Extending q-grams to estimate selectivity of string matching with low edit distance

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Accurate histogram-based XML summarization

Proceedings of the 2008 ACM symposium on Applied computing
SEPIA: estimating selectivities of approximate string predicates in large Databases

The VLDB Journal — The International Journal on Very Large Data Bases
XSelMark: A Micro-benchmark for Selectivity Estimation Approaches of XML Queries

DEXA '08 Proceedings of the 19th international conference on Database and Expert Systems Applications
Enabling XPath Optional Axes Cardinality Estimation Using Path Synopses

ADBIS '08 Proceedings of the 12th East European conference on Advances in Databases and Information Systems
EXsum: an XML summarization framework

IDEAS '08 Proceedings of the 2008 international symposium on Database engineering & applications
Synopsis based load shedding in XML streams

Proceedings of the 2009 EDBT/ICDT Workshops
Statistics-based parallelization of XPath queries in shared memory systems

Proceedings of the 13th International Conference on Extending Database Technology
Towards a comprehensive assessment for selectivity estimation approaches of XML queries

International Journal of Web Engineering and Technology
DMT: a flexible and versatile selectivity estimation approach for graph query

WAIM'05 Proceedings of the 6th international conference on Advances in Web-Age Information Management
A decomposition-based probabilistic framework for estimating the selectivity of XML twig queries

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
A histogram-based selectivity estimator for skewed XML data

DEXA'05 Proceedings of the 16th international conference on Database and Expert Systems Applications
Counting graph matches with adaptive statistics collection

WAIM '06 Proceedings of the 7th international conference on Advances in Web-Age Information Management
A statistical approach for XML query size estimation

EDBT'04 Proceedings of the 2004 international conference on Current Trends in Database Technology
Histograms as statistical estimators for aggregate queries

Information Systems
A gossip-based approach for Internet-scale cardinality estimation of XPath queries over distributed semistructured data

The VLDB Journal — The International Journal on Very Large Data Bases

Quantified Score

Hi-index	0.00

Visualization

Abstract

The extensible mark-up language (XML) is gaining widespread use as a format for data exchange and storage on the World Wide Web. Queries over XML data require accurate selectivity estimation of path expressions to optimize query execution plans. Selectivity estimation of XML path expression is usually done based on summary statistics about the structure of the underlying XML repository. All previous methods require an off-line scan of the XML repository to collect the statistics. In this paper, we propose XPathLearner, a method for estimating selectivity of the most commonly used types of path expressions without looking at the XML data. XPathLearner gathers and refines the statistics using query feedback in an on-line manner and is especially suited to queries in Internet scale applications since the underlying XML repository is either inaccessible or too large to be scanned in its entirety. Besides the on-line property, our method also has two other novel features: (a) XPathLearner is workload-aware in collecting the statistics and thus can be more accurate than the more costly off-line method under tight memory constraints, and (b) XPathLearner automatically adjusts the statistics using query feedback when the underlying XML data change. We show empirically the estimation accuracy of our method using several real data sets.