A probabilistic learning method for XML annotation of documents

Authors:
Boris Chidlovskii;Jérôme Fuselier
Affiliations:
Xerox Research Centre Europe, Meylan, France;Xerox Research Centre Europe, Meylan, France and Université de Savoie, Laboratoire SysCom, Domaine Universitaire, Le Bourget-du-Lac, France
Venue:
IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
Year:
2005

Citing 9
Cited 5

A maximum entropy approach to natural language processing

Computational Linguistics
DTD inference for views of XML data

PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Automata theory for XML researchers

ACM SIGMOD Record
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Maximum Entropy Markov Models for Information Extraction and Segmentation

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Reverse Engineering for Web Data: From Visual to Semantic Structures

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Document Transformation System from Papers to XML Data Based on Pivot XML Document Method

ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
A comparison of algorithms for maximum entropy parameter estimation

COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20
Hierarchical hidden Markov models for information extraction

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence

Probabilistic Model for Structured Document Mapping

MLDM '07 Proceedings of the 5th international conference on Machine Learning and Data Mining in Pattern Recognition
Relaxation Labeling for Selecting and Exploiting Efficiently Non-local Dependencies in Sequence Labeling

PKDD 2007 Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases
Applications of Reinforcement Learning to Structured Prediction

Recent Advances in Reinforcement Learning
Structured prediction with reinforcement learning

Machine Learning
From layout to semantic: a reranking model for mapping web documents to mediated XML representations

Large Scale Semantic Access to Content (Text, Image, Video, and Sound)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider the problem of semantic annotation of semi-structured documents according to a target XML schema. The task is to annotate a document in a tree-like manner where the annotation tree is an instance of a tree class defined by DTD or W3C XML Schema descriptions. In the probabilistic setting, we cope with the tree annotation problem as a generalized probabilistic context-free parsing of an observation sequence where each observation comes with a probability distribution over terminals supplied by a probabilistic classifier associated with the content of documents. We determine the most probable tree annotation by maximizing the joint probability of selecting a terminal sequence for the observation sequence and the most probable parse for the selected terminal sequence.