A context-free markup language for semi-structured text

Authors:
Qian Xi;David Walker
Affiliations:
Princeton University, Princeton, NJ, USA;Princeton University, Princeton, NJ, USA
Venue:
PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Year:
2010

Citing 20
Cited 2

Demeter: a CASE study of software growth through parameterized classes

Journal of Object-Oriented Programming
Inference of k-Testable Languages in the Strict Sense and Application to Syntactic Pattern Recognition

IEEE Transactions on Pattern Analysis and Machine Intelligence
Learning Information Extraction Rules for Semi-Structured and Free Text

Machine Learning - Special issue on natural language learning
Packet types: abstract specification of network protocol messages

Proceedings of the conference on Applications, Technologies, Architectures, and Protocols for Computer Communication
DataScript - A Specification and Scripting Language for Binary Data

GPCE '02 Proceedings of the 1st ACM SIGPLAN/SIGSOFT conference on Generative Programming and Component Engineering
Inducing Probabilistic Grammars by Bayesian Model Merging

ICGI '94 Proceedings of the Second International Colloquium on Grammatical Inference and Applications
Potter's Wheel: An Interactive Data Cleaning System

Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Table extraction using conditional random fields

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Wrapper induction for information extraction

Wrapper induction for information extraction
Learning regular languages using RFSAs

Theoretical Computer Science - Special issue: Algorithmic learning theory
PADS: a domain-specific language for processing ad hoc data

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
The next 700 data description languages

Conference record of the 33rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Expressiveness and complexity of XML Schema

ACM Transactions on Database Systems (TODS)
binpac: a yacc for writing application protocol parsers

Proceedings of the 6th ACM SIGCOMM conference on Internet measurement
PADS/ML: a functional data description language

Proceedings of the 34th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
From dirt to shovels: fully automatic tool generation from ad hoc data

Proceedings of the 35th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
SchemaScope: a system for inferring and cleaning XML schemas

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
A generic programming toolkit for PADS/ML: first-class upgrades for third-party developers

PADL'08 Proceedings of the 10th international conference on Practical aspects of declarative languages

Automating string processing in spreadsheets using input-output examples

Proceedings of the 38th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
The PADS project: an overview

Proceedings of the 14th International Conference on Database Theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

An ad hoc data format is any nonstandard, semi-structured data format for which robust data processing tools are not easily available. In this paper, we present ANNE, a new kind of markup language designed to help users generate documentation and data processing tools for ad hoc text data. More specifically, given a new ad hoc data source, an ANNE programmer edits the document to add a number of simple annotations, which serve to specify its syntactic structure. Annotations include elements that specify constants, optional data, alternatives, enumerations, sequences, tabular data, and recursive patterns. The ANNE system uses a combination of user annotations and the raw data itself to extract a context-free grammar from the document. This context-free grammar can then be used to parse the data and transform it into an XML parse tree, which may be viewed through a browser for analysis or debugging purposes. In addition, the ANNE system generates a PADS/ML description, which may be saved as lasting documentation of the data format or compiled into a host of useful data processing tools. In addition to designing and implementing ANNE, we have devised a semantic theory for the core elements of the language. This semantic theory describes the editing process, which translates a raw, unannotated text document into an annotated document, and the grammar extraction process, which generates a context-free grammar from an annotated document. We also present an alternative characterization of system behavior by drawing upon ideas from the field of relevance logic. This secondary characterization, which we call relevance analysis, specifies a direct relationship between unannotated documents and the context-free grammars that our system can generate from them. Relevance analysis allows us to prove important theorems concerning the expressiveness and utility of our system.