Efficient subject-oriented evaluating and mining methods for data with schema uncertainty

Authors:
Yue Wang;Changjie Tang;Tengjiao Wang;Dongqing Yang;Jun Zhu
Affiliations:
Key Laboratory of High Confidence Software Technologies, Ministry of Education, Peking University, China;School of Computer Science, Sichuan University, Chengdu, China;Key Laboratory of High Confidence Software Technologies, Ministry of Education, Peking University, China;Key Laboratory of High Confidence Software Technologies, Ministry of Education, Peking University, China;China Birth Defect Monitoring Centre, Sichuan University, Chengdu, China
Venue:
ADMA'11 Proceedings of the 7th international conference on Advanced Data Mining and Applications - Volume Part I
Year:
2011

Citing 12
Cited 0

The Management of Probabilistic Data

IEEE Transactions on Knowledge and Data Engineering
The Theory of Probabilistic Databases

VLDB '87 Proceedings of the 13th International Conference on Very Large Data Bases
Working Models for Uncertain Data

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Towards correcting input data errors probabilistically using integrity constraints

MobiDE '06 Proceedings of the 5th ACM international workshop on Data engineering for wireless and mobile access
Estimating statistical aggregates on probabilistic data streams

Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
MCDB: a monte carlo approach to managing uncertain data

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Monte-Carlo algorithms for enumeration and reliability problems

SFCS '83 Proceedings of the 24th Annual Symposium on Foundations of Computer Science
Managing and Mining Uncertain Data

Managing and Mining Uncertain Data
Data integration with uncertainty

The VLDB Journal — The International Journal on Very Large Data Bases
Probabilistic frequent itemset mining in uncertain databases

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
E = MC3: managing uncertain enterprise data in a cluster-computing environment

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Mining positive and negative patterns for relevance feature discovery

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

With the progressing of data collecting methods, people have already collected scales of data for various application fields such as medical science, meteorology, electronic commerce and so on. To analyze these data needs to integrate data from the various heterogeneous data sets. As historical reasons technically or non-technically, usually, the schemas of the data sets to be integrated are complex and different. Thus to analyze the integrated data may cause ambiguous results for their non-uniform schemas. This paper targets mining this kind of data, and its main contributions include:(1) proposed schema uncertainty to describe data with non-uniform schemas and proposed couple correlation degree (Cor) to evaluate the existence probabilities for records in data with schema uncertainty based on the analyzing subject;(2) designed a data structure "B-correlation tree" to establish a hierarchical structure for uncertain data with their existence probabilities and discussed the distribution affection by selecting nodes on different levels of B-correlation tree ; (3) proposed a efficient Monte Carlo uncertain data analyzing algorithm, MonteCarlo-evaluate (MCE), based on B-correlation tree for data with schema uncertainty; (4) analyzed the accuracy and convergence property for MCE theoretically; (5) implemented a prototype system by using B-correlation tree and MCE on real medical data and synthetic TPC-H benchmark?[20] data; provided sufficient experiments to test the effectiveness and efficiency of the provided methods. The results of experiments show that: the provided methods can efficient evaluate the schema uncertainty in data and thus can be equal to the tasks of analyzing large scale data with schema uncertainty efficiently.