Extract interesting skyline points in high dimension

  • Authors:
  • Gabriel Pui Cheong Fung;Wei Lu;Jing Yang;Xiaoyong Du;Xiaofang Zhou

  • Affiliations:
  • School of ITEE, The University of Queensland, Australia;Key Labs of Data Engineering and Knowledge Engineering, Ministry of Education, China;School of Information, Renmin University of China, China;Key Labs of Data Engineering and Knowledge Engineering, Ministry of Education, China;School of ITEE, The University of Queensland, Australia

  • Venue:
  • DASFAA'10 Proceedings of the 15th international conference on Database Systems for Advanced Applications - Volume Part II
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

When the dimensionality of dataset increases slightly, the number of skyline points increases dramatically as it is usually unlikely for a point to perform equally good in all dimensions. When the dimensionality is very high, almost all points are skyline points. Extract interesting skyline points in high dimensional space automatically is therefore necessary. From our experiences, in order to decide whether a point is an interesting one or not, we seldom base our decision on only comparing two points pairwisely (as in the situation of skyline identification) but further study how good a point can perform in each dimension. For example, in scholarship assignment problem, the students who are selected for scholarships should never be those who simply perform better than the weakest subjects of some other students (as in the situation of skyline). We should select students whose performance on some subjects are better than a reasonable number of students. In the extreme case, even though a student performs outstanding in just one subject, we may still give her scholarship if she can demonstrate she is extraordinary in that area. In this paper, we formalize this idea and propose a novel concept called k-dominate p-core skyline ($C^k_p$). $C^k_p$ is a subset of skyline. In order to identify $C^k_p$ efficiently, we propose an effective tree structure called Linked Multiple B’-tree (LMB). With LMB, we can identify $C^k_p$ within a few seconds from a dataset containing 100,000 points and 15 dimensions.