A study of applying dimensionality reduction to restrict the size of a hypothesis space

  • Authors:
  • Ashwin Srinivasan;Ravi Kothari

  • Affiliations:
  • IBM India Research Laboratory, Block 1, Indian Institute of Technology, New Delhi, India;IBM India Research Laboratory, Block 1, Indian Institute of Technology, New Delhi, India

  • Venue:
  • ILP'05 Proceedings of the 15th international conference on Inductive Logic Programming
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Given sample data and background knowledge encoded in the form of logic programs, a predictive Inductive Logic Programming (ILP) system attempts to find a set of rules (or clauses) for predicting classification labels in the data. Most present-day systems for this purpose rely on some variant of a generate-and-test procedure that repeatedly examines a set of potential candidates (termed here as the “hypothesis space”). On each iteration a search procedure is employed to find the “best” clause. The worst-case time-complexity of such systems depends critically on: (1) the size of the hypothesis spaces examined; and (2) the cost of estimating the goodness of a clause. To date, attempts to improve the efficiency of such ILP systems have concentrated either on examining fewer clauses within a given hypothesis space, or on efficient means of estimating the goodness of clauses. The principal means of restricting the size of the hypothesis space itself has been through the use of language and search constraints. Given such constraints, this paper is concerned with investigating the use of a dimensionality reduction method to reduce further the size of the hypothesis space. Specifically, for a particular kind of ILP system, clauses in the search space are represented as points in a high-dimension space. Using a sample of points from this geometric space, feature selection is used to discard dimensions of little or no (statistical) relevance. The resulting lower dimensional space translates directly, in the worst-case, to a smaller hypothesis space. We evaluate this approach on one controlled domain (graphs) and two real-life datasets concerning problems from biochemistry (mutagenesis and carcinogenesis). In each case, we obtain unbiased estimates of the size of the hypothesis space before and after feature selection; and compare the the results obtained using a complete search of the two spaces.