Efficient Multidimensional Quantitative Hypotheses Generation

Authors:
Amihood Amir;Reuven Kashi;Nathan S. Netanyahu
Affiliations:
-;-;-
Venue:
ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Year:
2003

Citing 6
Cited 0

Robust regression and outlier detection

Robust regression and outlier detection
Improved histograms for selectivity estimation of range predicates

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Histogram-based estimation techniques in database systems

Histogram-based estimation techniques in database systems
Mining the most interesting rules

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Supporting Data Mining of Large Databases by Visual Feedback Queries

Proceedings of the Tenth International Conference on Data Engineering
Analyzing Quantitative Databases: Image is Everything

Proceedings of the 27th International Conference on Very Large Data Bases

Quantified Score

Hi-index	0.00

Visualization

Abstract

Finding local interrelations (hypotheses) among attributeswithin very large databases of high dimensionalityis an acute problem for many databases and data miningapplications. These include, dependency modeling, clusteringlarge databases, correlation and link analysis.Traditional statistical methods are concerned with the corroborationof (a set of) hypotheses on a given body ofdata. Testing all of the hypotheses that can be generatedfrom a database with millions of records and dozens offields is clearly infeasible. Generating, on the other hand,a set of the most "promising" hypotheses (to be corroborated)requires much intuition and ingenuity.In this paper we present an efficient method for rankingthe multidimensional hypotheses using image processingof data visualization. In the heart of the method lies theuse of visualization techniques and image processing ideasto rank subsets of attributes according to the relation betweenthem in the databases. Some of the scalability issuesare solved by concise generalized histograms and by usingan efficient on-line computation of clustering around amedian with only five additional memory words. In additionto presenting our algorithmic methodology, we demonstrateits efficiency and performance by applying it to realcensus data sets, as well as synthetic data sets.