Learning Information Extraction Rules for Semi-Structured and Free Text
Machine Learning - Special issue on natural language learning
Inference of Reversible Languages
Journal of the ACM (JACM)
XTRACT: a system for extracting document type descriptors from XML documents
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Automatic segmentation of text into structured records
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Grammatical Inference: An Introduction Survey
ICGI '94 Proceedings of the Second International Colloquium on Grammatical Inference and Applications
Inducing Probabilistic Grammars by Bayesian Model Merging
ICGI '94 Proceedings of the Second International Colloquium on Grammatical Inference and Applications
Potter's Wheel: An Interactive Data Cleaning System
Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
The Rufus System: Information Organization for Semi-Structured Data
VLDB '93 Proceedings of the 19th International Conference on Very Large Data Bases
Table extraction using conditional random fields
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Extracting structured data from Web pages
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Wrapper induction for information extraction
Wrapper induction for information extraction
Learning regular languages using RFSAs
Theoretical Computer Science - Special issue: Algorithmic learning theory
Using the structure of Web sites for automatic segmentation of tables
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Learning to recognize tables in free text
ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
PADS: a domain-specific language for processing ad hoc data
Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
The next 700 data description languages
Conference record of the 33rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Inference of concise DTDs from XML data
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Expressiveness and complexity of XML Schema
ACM Transactions on Database Systems (TODS)
PADS/ML: a functional data description language
Proceedings of the 34th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Inferring XML schema definitions from XML data
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Finding structure via compression
NeMLaP3/CoNLL '98 Proceedings of the Joint Conferences on New Methods in Language Processing and Computational Natural Language Learning
Active learning with strong and weak views: a case study on wrapper induction
IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Learning (k,l)-contextual tree languages for information extraction
ECML'05 Proceedings of the 16th European conference on Machine Learning
LearnPADS: automatic tool generation from ad hoc data
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Tupni: automatic reverse engineering of input formats
Proceedings of the 15th ACM conference on Computer and communications security
Ad Hoc Data and the Token Ambiguity Problem
PADL '09 Proceedings of the 11th International Symposium on Practical Aspects of Declarative Languages
Bidirectional Transformations: A Cross-Discipline Perspective
ICMT '09 Proceedings of the 2nd International Conference on Theory and Practice of Model Transformations
Detecting large-scale system problems by mining console logs
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Semantics and algorithms for data-dependent grammars
Proceedings of the 37th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
SherLog: error diagnosis by connecting clues from run-time logs
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Incremental learning of system log formats
ACM SIGOPS Operating Systems Review
A context-free markup language for semi-structured text
PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Reverse engineering for mobile systems forensics with Ares
Proceedings of the 2010 ACM workshop on Insider threats
Automating string processing in spreadsheets using input-output examples
Proceedings of the 38th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
A graphical representation for identifier structure in logs
SLAML'10 Proceedings of the 2010 workshop on Managing systems via log analysis and machine learning techniques
Experience mining Google's production console logs
SLAML'10 Proceedings of the 2010 workshop on Managing systems via log analysis and machine learning techniques
Inferring meta-models for runtime system data from the clients of management APIs
MODELS'10 Proceedings of the 13th international conference on Model driven engineering languages and systems: Part II
Proceedings of the 14th International Conference on Database Theory
Spreadsheet table transformations from examples
Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Forensic triage for mobile phones with DEC0DE
SEC'11 Proceedings of the 20th USENIX conference on Security
A library for processing ad hoc data in Haskell: embedding a data description language
IFL'08 Proceedings of the 20th international conference on Implementation and application of functional languages
LearnPADS++: incremental inference of ad hoc data formats
PADL'12 Proceedings of the 14th international conference on Practical Aspects of Declarative Languages
A system for extracting top-K lists from the web
Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Unsupervised discovery and extraction of semi-structured regions in text via self-information
Proceedings of the 2013 workshop on Automated knowledge base construction
The operating system: should there be one?
Proceedings of the Seventh Workshop on Programming Languages and Operating Systems
Hi-index | 0.02 |
An ad hoc data source is any semistructured data source for which useful data analysis and transformation tools are not readily available. Such data must be queried, transformed and displayed by systems administrators, computational biologists, financial analysts and hosts of others on a regular basis. In this paper, we demonstrate that it is possible to generate a suite of useful data processing tools, including a semi-structured query engine, several format converters, a statistical analyzer and data visualization routines directly from the ad hoc data itself, without any human intervention. The key technical contribution of the work is a multi-phase algorithm that automatically infers the structure of an ad hoc data source and produces a format specification in the PADS data description language. Programmers wishing to implement custom data analysis tools can use such descriptions to generate printing and parsing libraries for the data. Alternatively, our software infrastructure will push these descriptions through the PADS compiler, creating format-dependent modules that, when linked with format-independent algorithms for analysis and transformation, result infully functional tools. We evaluate the performance of our inference algorithm, showing it scales linearlyin the size of the training data - completing in seconds, as opposed to the hours or days it takes to write a description by hand. We also evaluate the correctness of the algorithm, demonstrating that generating accurate descriptions often requires less than 5% of theavailable data.