Database techniques for the World-Wide Web: a survey
ACM SIGMOD Record
Information extraction from HTML: application of a general machine learning approach
AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
A hierarchical approach to wrapper induction
Proceedings of the third annual conference on Autonomous Agents
Generating finite-state transducers for semi-structured data extraction from the Web
Information Systems - Special issue on semistructured data
Inference of Reversible Languages
Journal of the ACM (JACM)
Wrapper induction: efficiency and expressiveness
Artificial Intelligence - Special issue on Intelligent internet systems
Introduction To Automata Theory, Languages, And Computation
Introduction To Automata Theory, Languages, And Computation
Learning Subsequential Transducers for Pattern Recognition Interpretation Tasks
IEEE Transactions on Pattern Analysis and Machine Intelligence
Wrapper Generation via Grammar Induction
ECML '00 Proceedings of the 11th European Conference on Machine Learning
Learning n-ary node selecting tree transducers from completely annotated examples
ICGI'06 Proceedings of the 8th international conference on Grammatical Inference: algorithms and applications
An intelligent metadata extraction approach based on programming by demonstration
WISM'12 Proceedings of the 2012 international conference on Web Information Systems and Mining
Hi-index | 0.00 |
Modern agent and mediator systems communicate to a multitude of Web information providers to better satisfy user requests. They use wrappers to extract relevant information from HTML responses and to annotate it with user-defined labels. A number of approaches exploit the methods of machine learning to induce instances of certain wrapper classes, by assuming the tabular structure of HTML responses and by observing the regularity of extracted fragments in the HTML structure. In this work, we propose a general approach and consider the information extraction conducted by wrappers as a special form of transduction. We make no assumption about the HTML response structure and profit from the advanced methods of transducer induction, in order to develop two powerful wrapper classes, for samples with and without ambiguous translations.We test the proposed induction methods on a set of general-purpose and bibliographic data providers and report the results of experiments.