Analysis and taxonomy of column header categories for web tables

  • Authors:
  • Sharad Seth;Ramana Jandhyala;Mukkai Krishnamoorthy;George Nagy

  • Affiliations:
  • University of Nebraska -- Lincoln, Lincoln, NE;Rensselaer Polytechnic Institute, Troy, NY;Rensselaer Polytechnic Institute, Troy, NY;Rensselaer Polytechnic Institute, Troy, NY

  • Venue:
  • DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

We describe a component of a document analysis system for constructing ontologies for domain-specific web tables imported into Excel. This component automates extraction of the Wang Notation for the column header of a table. Using column-header specific rules for XY cutting we convert the geometric structure of the column header to a linear string denoting cell attributes and directions of cuts. The string representation is parsed by a context-free grammar and the parse tree is further processed to produce an abstract data-type representation (the Wang notation tree) of each column category. Experiments were carried out to evaluate this scheme on the original and edited column headers of Excel tables drawn from a collection of 200 used in our earlier work. The transformed headers were obtained by editing the original column headers to conform to the format targeted by our grammar. Forty-four original headers and their reformatted versions were submitted as input to our software system. Our grammar was able to parse and the extract Wang notation tree for all the edited headers, but for only four of the original headers. We suggest extensions to our table grammar that would enable processing a larger fraction of headers without manual editing.