The Management of Probabilistic Data
IEEE Transactions on Knowledge and Data Engineering
The Theory of Probabilistic Databases
VLDB '87 Proceedings of the 13th International Conference on Very Large Data Bases
Working Models for Uncertain Data
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Towards correcting input data errors probabilistically using integrity constraints
MobiDE '06 Proceedings of the 5th ACM international workshop on Data engineering for wireless and mobile access
Estimating statistical aggregates on probabilistic data streams
Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
MCDB: a monte carlo approach to managing uncertain data
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Monte-Carlo algorithms for enumeration and reliability problems
SFCS '83 Proceedings of the 24th Annual Symposium on Foundations of Computer Science
Managing and Mining Uncertain Data
Managing and Mining Uncertain Data
Data integration with uncertainty
The VLDB Journal — The International Journal on Very Large Data Bases
Probabilistic frequent itemset mining in uncertain databases
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
E = MC3: managing uncertain enterprise data in a cluster-computing environment
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Mining positive and negative patterns for relevance feature discovery
Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
Hi-index | 0.00 |
With the progressing of data collecting methods, people have already collected scales of data for various application fields such as medical science, meteorology, electronic commerce and so on. To analyze these data needs to integrate data from the various heterogeneous data sets. As historical reasons technically or non-technically, usually, the schemas of the data sets to be integrated are complex and different. Thus to analyze the integrated data may cause ambiguous results for their non-uniform schemas. This paper targets mining this kind of data, and its main contributions include:(1) proposed schema uncertainty to describe data with non-uniform schemas and proposed couple correlation degree (Cor) to evaluate the existence probabilities for records in data with schema uncertainty based on the analyzing subject;(2) designed a data structure "B-correlation tree" to establish a hierarchical structure for uncertain data with their existence probabilities and discussed the distribution affection by selecting nodes on different levels of B-correlation tree ; (3) proposed a efficient Monte Carlo uncertain data analyzing algorithm, MonteCarlo-evaluate (MCE), based on B-correlation tree for data with schema uncertainty; (4) analyzed the accuracy and convergence property for MCE theoretically; (5) implemented a prototype system by using B-correlation tree and MCE on real medical data and synthetic TPC-H benchmark?[20] data; provided sufficient experiments to test the effectiveness and efficiency of the provided methods. The results of experiments show that: the provided methods can efficient evaluate the schema uncertainty in data and thus can be equal to the tasks of analyzing large scale data with schema uncertainty efficiently.