An efficient context-free parsing algorithm
Communications of the ACM
Modern Compiler Implementation: In ML
Modern Compiler Implementation: In ML
Wrapper Generation via Grammar Induction
ECML '00 Proceedings of the 11th European Conference on Machine Learning
Grammatical Inference: An Introduction Survey
ICGI '94 Proceedings of the Second International Colloquium on Grammatical Inference and Applications
An incremental interactive algorithm for grammar inference
ICG! '96 Proceedings of the 3rd International Colloquium on Grammatical Inference: Learning Syntax from Sentences
Current Trends in Grammatical Inference
Proceedings of the Joint IAPR International Workshops on Advances in Pattern Recognition
Extracting structured data from Web pages
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
A survey on tree edit distance and related problems
Theoretical Computer Science
The Minimum Description Length Principle (Adaptive Computation and Machine Learning)
The Minimum Description Length Principle (Adaptive Computation and Machine Learning)
From dirt to shovels: fully automatic tool generation from ad hoc data
Proceedings of the 35th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
LearnPADS: automatic tool generation from ad hoc data
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Incremental learning of system log formats
ACM SIGOPS Operating Systems Review
Hi-index | 0.00 |
An ad hoc data source is any semi-structured, non-standard data source. The format of such data sources is often evolving and frequently lacking documentation. Consequently, off-the-shelf tools for processing such data often do not exist, forcing analysts to develop their own tools, a costly and time-consuming process. In this paper, we present an incremental algorithm that automatically infers the format of large-scale data sources. From the resulting format descriptions, we can generate a suite of data processing tools automatically. The system can handle large-scale or streaming data sources whose formats evolve over time. Furthermore, it allows analysts to modify inferred descriptions as desired and incorporates those changes in future revisions.