Ad Hoc Data and the Token Ambiguity Problem

Authors:
Qian Xi;Kathleen Fisher;David Walker;Kenny Q. Zhu
Affiliations:
Princeton University,;AT&T Research,;Princeton University,;Princeton University,
Venue:
PADL '09 Proceedings of the 11th International Symposium on Practical Aspects of Declarative Languages
Year:
2009

Citing 18
Cited 2

A maximum entropy approach to natural language processing

Computational Linguistics
Learning Information Extraction Rules for Semi-Structured and Free Text

Machine Learning - Special issue on natural language learning
Inference of Reversible Languages

Journal of the ACM (JACM)
XTRACT: a system for extracting document type descriptors from XML documents

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Automatic segmentation of text into structured records

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Grammatical Inference: An Introduction Survey

ICGI '94 Proceedings of the Second International Colloquium on Grammatical Inference and Applications
A Generalized Hidden Markov Model for the Recognition of Human Genes in DNA

Proceedings of the Fourth International Conference on Intelligent Systems for Molecular Biology
Table extraction using conditional random fields

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Wrapper induction for information extraction

Wrapper induction for information extraction
Speech repairs, intonational phrases, and discourse markers: modeling speakers' utterances in spoken dialogue

Computational Linguistics
Bayesian grammar induction for language modeling

ACL '95 Proceedings of the 33rd annual meeting on Association for Computational Linguistics
PADS: a domain-specific language for processing ad hoc data

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Inference of concise DTDs from XML data

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
The Minimum Description Length Principle (Adaptive Computation and Machine Learning)

The Minimum Description Length Principle (Adaptive Computation and Machine Learning)
From dirt to shovels: fully automatic tool generation from ad hoc data

Proceedings of the 35th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
LearnPADS: automatic tool generation from ad hoc data

Proceedings of the 2008 ACM SIGMOD international conference on Management of data

The PADS project: an overview

Proceedings of the 14th International Conference on Database Theory
Forensic triage for mobile phones with DEC0DE

SEC'11 Proceedings of the 20th USENIX conference on Security

Quantified Score

Hi-index	0.01

Visualization

Abstract

pads is a declarative language used to describe the syntax and semantic properties of ad hoc data sources such as financial transactions, server logs and scientific data sets. The pads compiler reads these descriptions and generates a suite of useful data processing tools such as format translators, parsers, printers and even a query engine, all customized to the ad hoc data format in question. Recently, however, to further improve the productivity of programmers that manage ad hoc data sources, we have turned to using pads as an intermediate language in a system that first infers a pads description directly from example data and then passes that description to the original compiler for tool generation. A key subproblem in the inference engine is the token ambiguity problem -- the problem of determining which substrings in the example data correspond to complex tokens such as dates, URLs, or comments. In order to solve the token ambiguity problem, the paper studies the relative effectiveness of three different statistical models for tokenizing ad hoc data. It also shows how to incorporate these models into a general and effective format inference algorithm. In addition to using a declarative language (pads ) as a key intermediate form, we have implemented the system as a whole in ml .