A maximum entropy approach to natural language processing
Computational Linguistics
The Hierarchical Hidden Markov Model: Analysis and Applications
Machine Learning
Stochastic Grammatical Inference of Text Database Structure
Machine Learning
Supervised learning for the legacy document conversion
Proceedings of the 2004 ACM symposium on Document engineering
Bayesian network model for semi-structured document classification
Information Processing and Management: an International Journal - Special issue: Bayesian networks and information retrieval
Semantic-integration research in the database community
AI Magazine - Special issue on semantic integration
Learning as search optimization: approximate large margin methods for structured prediction
ICML '05 Proceedings of the 22nd international conference on Machine learning
Information Extraction: Distilling Structured Data from Unstructured Text
Queue - Social Computing
A comparison of algorithms for maximum entropy parameter estimation
COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20
A probabilistic learning method for XML annotation of documents
IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
From layout to semantic: a reranking model for mapping web documents to mediated XML representations
Large Scale Semantic Access to Content (Text, Image, Video, and Sound)
Hi-index | 0.00 |
We address the problem of learning automatically to map heterogeneous semi-structured documents onto a mediated target XML schema. We adopt a machine learning approach where the mapping between input and target documents is learned from a training corpus of documents. We first introduce a general stochastic model of semi structured documents generation and transformation. This model relies on the concept of meta-document which is a latent variable providing a link between input and target documents. It allows us to learn the correspondences when the input documents are expressed in a large variety of schemas. We then detail an instance of the general model for the particular task of HTML to XML conversion. This instance is tested on three different corpora using two different inference methods: a dynamic programming method and an approximate LaSO-based method.