An introduction to support Vector Machines: and other kernel-based learning methods
An introduction to support Vector Machines: and other kernel-based learning methods
Stochastic Grammatical Inference of Text Database Structure
Machine Learning
Seeing the whole in parts: text summarization for web browsing on handheld devices
Proceedings of the 10th international conference on World Wide Web
Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
Document Transformation System from Papers to XML Data Based on Pivot XML Document Method
ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
Support vector machine learning for interdependent and structured output spaces
ICML '04 Proceedings of the twenty-first international conference on Machine learning
Supervised learning for the legacy document conversion
Proceedings of the 2004 ACM symposium on Document engineering
Editorial: special issue on web content mining
ACM SIGKDD Explorations Newsletter
Bootstrapping Semantic Annotation for Content-Rich HTML Documents
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Semantic-integration research in the database community
AI Magazine - Special issue on semantic integration
Learning Non-Generative Grammatical Models for Document Analysis
ICCV '05 Proceedings of the Tenth IEEE International Conference on Computer Vision - Volume 2
Learning as search optimization: approximate large margin methods for structured prediction
ICML '05 Proceedings of the 22nd international conference on Machine learning
Discriminative Reranking for Natural Language Parsing
Computational Linguistics
Proceedings of the 15th international conference on World Wide Web
A probabilistic learning method for XML annotation of documents
IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
Probabilistic Model for Structured Document Mapping
MLDM '07 Proceedings of the 5th international conference on Machine Learning and Data Mining in Pattern Recognition
PKDD 2007 Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases
Hi-index | 0.00 |
Many documents on the Web are formated in a weakly structured format. Because of their weak semantic and because of the heterogeneity of their formats, the information conveyed by their structure cannot be directly exploited. We consider here the conversion of such documents into a predefined mediated semi-structured format which will be more amenable to automatic processing of the document content. We develop a machine learning approach to this conversion problem where the transformation is learned automatically from a set of document examples manually transformed into the target structure. Our method proceeds in three steps. Given an input document, document elements are first annotated with labels of the target schema. Structured candidate documents are then generated using a generalized probabilistic context-free parsing algorithm. Finally candidates are reranked using a perceptron like ranking algorithm. Experiments performed on two different datasets show that the proposed method performs well in different contexts.