Lightweight structured text processing

  • Authors:
  • Robert C. Miller;Brad A. Myers

  • Affiliations:
  • School of Computer Science, Carnegie Mellon University, Pittsburgh, PA;School of Computer Science, Carnegie Mellon University, Pittsburgh, PA

  • Venue:
  • ATEC '99 Proceedings of the annual conference on USENIX Annual Technical Conference
  • Year:
  • 1999

Quantified Score

Hi-index 0.00

Visualization

Abstract

Text is a popular storage and distribution format for information, partly due to generic text-processing tools like Unix grep and sort. Unfortunately, existing generic tools make assumptions about text format (e.g., each line is a record) that limit their applicability. Custom-built tools are one alternative, but they require substantial time investment and programming expertise. We describe a new approach, lightweight structured text processing, which overcomes these difficulties by enabling users to define text structure interactively and manipulate the structure with generic tools. Our prototype system, LAPIS, is a web browser that can highlight, filter, and sort text regions described by the user. LAPIS has several advantages over other systems: (1) the ability to define custom structure with a simple, intuitive pattern language; (2) interactive specification, showing pattern matches in context and letting users choose the most convenient combination of manual selection and pattern matching; and (3) external parsers for standard text formats. The pattern language in LAPIS, text constraints, describes text structure in high-level terms, with region relationships like before, after, in, and contains. We describe an implementation of text constraints using a novel, compact representation of region sets as collections of rectangles, or region intervals. We also illustrate some examples of applying LAPIS to web pages, text files, and source code.