Demeter: a CASE study of software growth through parameterized classes
Journal of Object-Oriented Programming
IEEE Transactions on Pattern Analysis and Machine Intelligence
Learning Information Extraction Rules for Semi-Structured and Free Text
Machine Learning - Special issue on natural language learning
Packet types: abstract specification of network protocol messages
Proceedings of the conference on Applications, Technologies, Architectures, and Protocols for Computer Communication
DataScript - A Specification and Scripting Language for Binary Data
GPCE '02 Proceedings of the 1st ACM SIGPLAN/SIGSOFT conference on Generative Programming and Component Engineering
Inducing Probabilistic Grammars by Bayesian Model Merging
ICGI '94 Proceedings of the Second International Colloquium on Grammatical Inference and Applications
Potter's Wheel: An Interactive Data Cleaning System
Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
Table extraction using conditional random fields
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Extracting structured data from Web pages
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Wrapper induction for information extraction
Wrapper induction for information extraction
Learning regular languages using RFSAs
Theoretical Computer Science - Special issue: Algorithmic learning theory
PADS: a domain-specific language for processing ad hoc data
Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
The next 700 data description languages
Conference record of the 33rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Expressiveness and complexity of XML Schema
ACM Transactions on Database Systems (TODS)
binpac: a yacc for writing application protocol parsers
Proceedings of the 6th ACM SIGCOMM conference on Internet measurement
PADS/ML: a functional data description language
Proceedings of the 34th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
From dirt to shovels: fully automatic tool generation from ad hoc data
Proceedings of the 35th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
SchemaScope: a system for inferring and cleaning XML schemas
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
A generic programming toolkit for PADS/ML: first-class upgrades for third-party developers
PADL'08 Proceedings of the 10th international conference on Practical aspects of declarative languages
Automating string processing in spreadsheets using input-output examples
Proceedings of the 38th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Proceedings of the 14th International Conference on Database Theory
Hi-index | 0.00 |
An ad hoc data format is any nonstandard, semi-structured data format for which robust data processing tools are not easily available. In this paper, we present ANNE, a new kind of markup language designed to help users generate documentation and data processing tools for ad hoc text data. More specifically, given a new ad hoc data source, an ANNE programmer edits the document to add a number of simple annotations, which serve to specify its syntactic structure. Annotations include elements that specify constants, optional data, alternatives, enumerations, sequences, tabular data, and recursive patterns. The ANNE system uses a combination of user annotations and the raw data itself to extract a context-free grammar from the document. This context-free grammar can then be used to parse the data and transform it into an XML parse tree, which may be viewed through a browser for analysis or debugging purposes. In addition, the ANNE system generates a PADS/ML description, which may be saved as lasting documentation of the data format or compiled into a host of useful data processing tools. In addition to designing and implementing ANNE, we have devised a semantic theory for the core elements of the language. This semantic theory describes the editing process, which translates a raw, unannotated text document into an annotated document, and the grammar extraction process, which generates a context-free grammar from an annotated document. We also present an alternative characterization of system behavior by drawing upon ideas from the field of relevance logic. This secondary characterization, which we call relevance analysis, specifies a direct relationship between unannotated documents and the context-free grammars that our system can generate from them. Relevance analysis allows us to prove important theorems concerning the expressiveness and utility of our system.