Customized information extraction as a basis for resource discovery

Authors:
Darren R. Hardy;Michael F. Schwartz
Affiliations:
Univ. of Colorado, Boulder;Univ. of Colorado, Boulder
Venue:
ACM Transactions on Computer Systems (TOCS)
Year:
1996

Citing 12
Cited 4

An evaluation of retrieval effectiveness for a full-text document-retrieval system

Communications of the ACM
Another look at automatic text-retrieval systems

Communications of the ACM
Computer networks

Computer networks
Architectural considerations for a new generation of protocols

SIGCOMM '90 Proceedings of the ACM symposium on Communications architectures & protocols
Semantic file systems

SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
A case for caching file objects inside internetworks

SIGCOMM '93 Conference proceedings on Communications architectures, protocols and applications
Scalable Internet resource discovery: research problems and approaches

Communications of the ACM
Content routing for distributed information servers

EDBT '94 Proceedings of the 4th international conference on extending database technology: Advances in database technology
The official PGP user's guide

The official PGP user's guide
The Harvest information discovery and access system

Computer Networks and ISDN Systems
A trace-driven analysis of the UNIX 4.2 BSD file system

Proceedings of the tenth ACM symposium on Operating systems principles
Information Retrieval: Application Service Definition and Protocol Specification, Z39.50-1995

Information Retrieval: Application Service Definition and Protocol Specification, Z39.50-1995

W3QS: A Query System for the World-Wide Web

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Using Conceptual Modeling and Intelligent Agents to Integrate Semi-structured Documents in Federated Databases

Selected Papers from the Symposium on Conceptual Modeling, Current Issues and Future Directions
An information retrieval system to manage program maintenance reports in a data processing shop

ACM-SE 38 Proceedings of the 38th annual on Southeast regional conference
A Quantitative Evaluation of Dissemination-Time Preservation Metadata

ECDL '08 Proceedings of the 12th European conference on Research and Advanced Technology for Digital Libraries

Quantified Score

Hi-index	0.00

Visualization

Abstract

Indexing file contents is a powerful means of helping users locate documents, software, and other types of data among large repositories. In environments that contain many different types of data, content indexing requires type-specific processing to extract information effectively. We present a model for type-specific, user-customizable information extraction, and a system implementation called Essence. This software structure allows users to associate specialized extraction methods with ordinary files, providing the illusion of an object-oriented file system that encapsulates indexing methods within files. By exploiting the semantics of common file types, Essence generates compact yet representative file summaries that can be used to improve both browsing and indexing in resource discovery systems. Essence can extract information from most of the types of files found in common file systems, including files with nested structure (such as compressed “tar” files). Essence interoperates with a number of different search/index systems (such as WAIS and Glimpse), as part of the Harvest system.