Effective label acquisition for collective classification
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Metropolis Algorithms for Representative Subgraph Sampling
ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
Walking in facebook: a case study of unbiased sampling of OSNs
INFOCOM'10 Proceedings of the 29th conference on Information communications
Supervised random walks: predicting and recommending links in social networks
Proceedings of the fourth ACM international conference on Web search and data mining
Hi-index | 0.00 |
Modern information networks, such as social networks, are often characterized with large sizes and dynamic changing structures. To analyze these networks, existing solutions commonly rely on graph sampling techniques to reduce network sizes, and then carry out succeeding mining processes, such as labeling network nodes to build classification models. Such a sampling-then-labeling paradigm assumes that the whole network is available for sampling and the sampled network is useful for all subsequent tasks (such as network classification). Yet real-world networks are rarely immediately available unless the sampling process progressively crawls every single node and its connections. Meanwhile, without knowing the underlying analytic objective, the sampled network can hardly produce quality results. In this paper, we propose an Active Exploration framework for large graphs where the goal is to carry out network sampling and node labeling at the same time. To achieve this goal, we consider a network as a Markov chain and compute its stationary distribution by using supervised random walks. The stationary distribution of the sampled network help identify important nodes to be explored in the next step, and the labeling process labels the most informative node which in turn strengthens the sampling process. The mutually and simultaneously enhanced sampling and labeling processes ensure that the final network contains a maximum number of nodes directly related to the underlying mining tasks.