Customized information extraction as a basis for resource discovery

  • Authors:
  • Darren R. Hardy;Michael F. Schwartz

  • Affiliations:
  • Univ. of Colorado, Boulder;Univ. of Colorado, Boulder

  • Venue:
  • ACM Transactions on Computer Systems (TOCS)
  • Year:
  • 1996

Quantified Score

Hi-index 0.00

Visualization

Abstract

Indexing file contents is a powerful means of helping users locate documents, software, and other types of data among large repositories. In environments that contain many different types of data, content indexing requires type-specific processing to extract information effectively. We present a model for type-specific, user-customizable information extraction, and a system implementation called Essence. This software structure allows users to associate specialized extraction methods with ordinary files, providing the illusion of an object-oriented file system that encapsulates indexing methods within files. By exploiting the semantics of common file types, Essence generates compact yet representative file summaries that can be used to improve both browsing and indexing in resource discovery systems. Essence can extract information from most of the types of files found in common file systems, including files with nested structure (such as compressed “tar” files). Essence interoperates with a number of different search/index systems (such as WAIS and Glimpse), as part of the Harvest system.