Scalable mining and link analysis across multiple database relations

Authors:
Jiawei Han;Xiaoxin Yin
Affiliations:
University of Illinois at Urbana-Champaign;University of Illinois at Urbana-Champaign
Venue:
Scalable mining and link analysis across multiple database relations
Year:
2007

Citing 0
Cited 1

Data mining from multiple heterogeneous relational databases using decision tree classification

Pattern Recognition Letters

Quantified Score

Hi-index	0.00

Visualization

Abstract

Relational databases are the most popular repository for structured data, and are thus one of the richest sources of knowledge in the world. In a relational database, multiple relations are linked together via entity-relationship links. Unfortunately, most existing data mining approaches can only handle data stored in single tables, and cannot be applied to relational databases. Therefore, it is an urgent task to design data mining approaches that can discover knowledge from multi-relational data. In this thesis we study three most important data mining tasks in multi-relational environments: classification, clustering, and duplicate detection. Since information is widely spread across multiple relations, the most crucial and common challenge in multi-relational data mining is how to utilize the relational information linked with each object. We rely on two types of information—neighbor tuples and linkages between objects—to analyze the properties of objects and relationships among them. Because of the complexity of multi-relational data, efficiency and scalability are two major concerns in multi-relational data mining. In this thesis we propose scalable and accurate approaches for each data mining task studied. In order to achieve high efficiency and scalability, the approaches utilize novel techniques for virtually joining different relations, single-scan algorithms, and multi-resolutional data structures to dramatically reduce computational costs. Our experiments show that our approaches are highly efficient and scalable, and also achieve high accuracies in multi-relational data mining.