Unsupervised discovery and extraction of semi-structured regions in text via self-information

  • Authors:
  • Eric Yeh;John Niekrasz;Dayne Freitag

  • Affiliations:
  • SRI International, Menlo Park, CA, USA;SRI International, San Diego, CA, USA;SRI International, San Diego, CA, USA

  • Venue:
  • Proceedings of the 2013 workshop on Automated knowledge base construction
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

We describe a general method for identifying and extracting information from semi-structured regions of text embedded within a natural language document. These regions encode information according to ad hoc schemas and visual cues, instead of using the grammatical and presentational conventions of normal sentential language. Examples include tables, key-value listings, or repeated enumerations of properties. Because of their generally non-sentential nature, these regions can present problems for standard information extraction algorithms. Unlike previous work in table extraction, which relies on a relatively noiseless two-dimensional layout, our aim is to accommodate a wide variety of structure types. Our approach for identifying semi-structured regions is an unsupervised one, based on scoring unusual regularity inside the document. As content in semi-structured regions are governed by a schema, the occurrence of features encompassing textual content and visual appearance would be unusual compared to those seen in sentential language. Regularity refers to repetition of these unusual features, as semi-structured regions commonly encode more than a single row or group of information. To score this, we present a measure based on expected self-information, derived from statistics over patterns of textual categories and visual layout. We describe the results of an initial study to assess the ability of these measures to detect semi-structured text in a corpus culled from the web, and show that this measure outperform baseline methods on an average precision measure. We present initial work that uses these significant patterns to generate extraction rules, and conclude with a discussion of future directions.