A Study of Two Sampling Methods for Analyzing Large Datasets with ILP

  • Authors:
  • Ashwin Srinivasan

  • Affiliations:
  • Oxford University Computing Laboratory, Oxford UK. ashwin@comlab.ox.ac.uk

  • Venue:
  • Data Mining and Knowledge Discovery
  • Year:
  • 1999

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper is concerned with problems that arise whensubmitting large quantities of data to analysis by an Inductive LogicProgramming (ILP) system. Complexity arguments usually make itprohibitive to analyse such datasets in their entirety. We examinetwo schemes that allow an ILP system to construct theories bysampling from this large pool of data. The first, “subsampling”,is a single-sample design in which the utility of a potential rule isevaluated on a randomly selected sub-sample of the data. The second,“logical windowing”, is multiple-sample design that tests andsequentially includes errors made by a partially correct theory. Bothschemes are derived from techniques developed to enable propositionallearning methods (like decision trees) to cope with large datasets.The ILP system CProgol, equipped with each of these methods, is usedto construct theories for two datasets—one artificial (a chessendgame) and the other naturally occurring (a language taggingproblem). In each case, we ask the following questions of CProgolequipped with sampling: (1) Is its theory comparable in predictiveaccuracy to that obtained if all the data were used (that is, nosampling was employed)?; and (2) Is its theory constructed in lesstime than the one obtained with all the data? For the problemsconsidered, the answers to these questions is “yes”. This suggeststhat an ILP program equipped with an appropriate sampling methodcould begin to address problems satisfactorily that have hithertobeen inaccessible simply due to data extent.