Markup systems and the future of scholarly text processing
Communications of the ACM
On the learnability and usage of acyclic probabilistic finite automata
COLT '95 Proceedings of the eighth annual conference on Computational learning theory
Lore: a database management system for semistructured data
ACM SIGMOD Record
World Wide Web Journal - Special issue on XML: principles, tools, and techniques
Regular right part grammars and their parsers
Communications of the ACM
Introduction To Automata Theory, Languages, And Computation
Introduction To Automata Theory, Languages, And Computation
Hidden Markov Models for Speech Recognition
Hidden Markov Models for Speech Recognition
Grammatical Inference: An Introduction Survey
ICGI '94 Proceedings of the Second International Colloquium on Grammatical Inference and Applications
Learning Stochastic Regular Grammars by Means of a State Merging Method
ICGI '94 Proceedings of the Second International Colloquium on Grammatical Inference and Applications
Forming Grammars for Structured Documents: an Application of Grammatical Inference
ICGI '94 Proceedings of the Second International Colloquium on Grammatical Inference and Applications
Inducing Probabilistic Grammars by Bayesian Model Merging
ICGI '94 Proceedings of the Second International Colloquium on Grammatical Inference and Applications
DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases
VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Mind Your Grammar: a New Approach to Modelling Text
VLDB '87 Proceedings of the 13th International Conference on Very Large Data Bases
Statistical Inductive Learning of Regular Formal Languages
ICGI '94 Proceedings of the Second International Colloquium on Grammatical Inference and Applications
XTRACT: Learning Document Type Descriptors from XML Document Collections
Data Mining and Knowledge Discovery
Stochastic Grammatical Inference with Multinomial Tests
ICGI '02 Proceedings of the 6th International Colloquium on Grammatical Inference: Algorithms and Applications
ICGI '02 Proceedings of the 6th International Colloquium on Grammatical Inference: Algorithms and Applications
Structuring Domain-Specific Text Archives by Deriving a Probabilistic XML DTD
PKDD '02 Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery
Probabilistic Finite-State Machines-Part II
IEEE Transactions on Pattern Analysis and Machine Intelligence
Probabilistic Finite-State Machines-Part I
IEEE Transactions on Pattern Analysis and Machine Intelligence
Multi-column substring matching for database schema translation
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Learning deterministic regular expressions for the inference of schemas from XML data
Proceedings of the 17th international conference on World Wide Web
Probabilistic Model for Structured Document Mapping
MLDM '07 Proceedings of the 5th international conference on Machine Learning and Data Mining in Pattern Recognition
An Automata Based Authorship Identification System
New Frontiers in Applied Data Mining
A bibliographical study of grammatical inference
Pattern Recognition
Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data
ACM Transactions on the Web (TWEB)
From layout to semantic: a reranking model for mapping web documents to mediated XML representations
Large Scale Semantic Access to Content (Text, Image, Video, and Sound)
Ten open problems in grammatical inference
ICGI'06 Proceedings of the 8th international conference on Grammatical Inference: algorithms and applications
Hi-index | 0.00 |
For a document collection in which structural elements are identified with markup, it is often necessary to construct a grammar retrospectively that constrains element nesting and ordering. This has been addressed by others as an application of grammatical inference. We describe an approach based on stochastic grammatical inference which scales more naturally to large data sets and produces models with richer semantics. We adopt an algorithm that produces stochastic finite automata and describe modifications that enable better interactive control of results. Our experimental evaluation uses four document collections with varying structure.