Using ILP to construct features for information extraction from semi-structured text

  • Authors:
  • Ganesh Ramakrishnan;Sachindra Joshi;Sreeram Balakrishnan;Ashwin Srinivasan

  • Affiliations:
  • IBM India Research Laboratory, Indian Institute of Technology, New Delhi, India;IBM India Research Laboratory, Indian Institute of Technology, New Delhi, India;IBM India Research Laboratory, Indian Institute of Technology, New Delhi, India;IBM India Research Laboratory, Indian Institute of Technology, New Delhi, India and Dept. of CSE & Centre for Health Informatics, University of New Kensington, Sydney, Australia

  • Venue:
  • ILP'07 Proceedings of the 17th international conference on Inductive logic programming
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Machine-generated documents containing semistructured text are rapidly forming the bulk of data being stored in an organisation. Given a feature-based representation of such data, methods like SVMs are able to construct good models for information extraction (IE). But how are the feature-definitions to be obtained in the first place? (We are referring here to the representation problem: selecting good features from the ones defined comes later.) So far, features have been defined manually or by using special-purpose programs: neither approach scaling well to handle the heterogeneity of the data or new domain-specific information. We suggest that Inductive Logic Programming (ILP) could assist in this. Specifically, we demonstrate the use of ILP to define features for seven IE tasks using two disparate sources of information. Our findings are as follows: (1) the ILP system is able to identify efficiently large numbers of good features. Typically, the time taken to identify the features is comparable to the time taken to construct the predictive model; and (2) SVM models constructed with these ILP-features are better than the best reported to date that rely heavily on hand-crafted features. For the ILP practioneer, we also present evidence supporting the claim that, for IE tasks, using an ILP system to assist in constructing an extensional representation of text data (in the form of features and their values) is better than using it to construct intensional models for the tasks (in the form of rules for information extraction).