From layout to semantic: a reranking model for mapping web documents to mediated XML representations

Authors:
Guillaume Wisniewski;Patrick Gallinari
Affiliations:
LIP6 --- UPMC, Paris, France;LIP6 --- UPMC, Paris, France
Venue:
Large Scale Semantic Access to Content (Text, Image, Video, and Sound)
Year:
2007

Citing 19
Cited 2

An introduction to support Vector Machines: and other kernel-based learning methods

An introduction to support Vector Machines: and other kernel-based learning methods
Stochastic Grammatical Inference of Text Database Structure

Machine Learning
Seeing the whole in parts: text summarization for web browsing on handheld devices

Proceedings of the 10th international conference on World Wide Web
Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition

Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition
Learning to Match the Schemas of Data Sources: A Multistrategy Approach

Machine Learning
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Document Transformation System from Papers to XML Data Based on Pivot XML Document Method

ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
Support vector machine learning for interdependent and structured output spaces

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Supervised learning for the legacy document conversion

Proceedings of the 2004 ACM symposium on Document engineering
Editorial: special issue on web content mining

ACM SIGKDD Explorations Newsletter
Bootstrapping Semantic Annotation for Content-Rich HTML Documents

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Learning to extract information from semi-structured text using a discriminative context free grammar

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Semantic-integration research in the database community

AI Magazine - Special issue on semantic integration
Learning Non-Generative Grammatical Models for Document Analysis

ICCV '05 Proceedings of the Tenth IEEE International Conference on Computer Vision - Volume 2
Learning as search optimization: approximate large margin methods for structured prediction

ICML '05 Proceedings of the 22nd international conference on Machine learning
Discriminative Reranking for Natural Language Parsing

Computational Linguistics
Semantic Wikipedia

Proceedings of the 15th international conference on World Wide Web
A probabilistic learning method for XML annotation of documents

IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence

Probabilistic Model for Structured Document Mapping

MLDM '07 Proceedings of the 5th international conference on Machine Learning and Data Mining in Pattern Recognition
Relaxation Labeling for Selecting and Exploiting Efficiently Non-local Dependencies in Sequence Labeling

PKDD 2007 Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many documents on the Web are formated in a weakly structured format. Because of their weak semantic and because of the heterogeneity of their formats, the information conveyed by their structure cannot be directly exploited. We consider here the conversion of such documents into a predefined mediated semi-structured format which will be more amenable to automatic processing of the document content. We develop a machine learning approach to this conversion problem where the transformation is learned automatically from a set of document examples manually transformed into the target structure. Our method proceeds in three steps. Given an input document, document elements are first annotated with labels of the target schema. Structured candidate documents are then generated using a generalized probabilistic context-free parsing algorithm. Finally candidates are reranked using a perceptron like ranking algorithm. Experiments performed on two different datasets show that the proposed method performs well in different contexts.