Web wrapper induction: a brief survey

Authors:
Sergio Flesca;Giuseppe Manco;Elio Masciari;Eugenio Rende;Andrea Tagarelli
Affiliations:
DEIS, University of Calabria, 87030 Rende, Italy E-mail: flesca@si.deis.unical.it (S. Flesca and E. Rende are partially supported by the EU project Infomix and by Lixto Software GmbH);ICAR-CNR, 87030 Rende, Italy E-mail: manco@icar.cnr.it;ICAR-CNR, 87030 Rende, Italy E-mail: masciari@icar.cnr.it;DEIS, University of Calabria, 87030 Rende, Italy E-mail: erende@si.deis.unical.it;DEIS, University of Calabria, 87030 Rende, Italy E-mail: {flesca,erende,tagarelli}@si.deis.unical.it
Venue:
AI Communications
Year:
2004

Citing 14
Cited 15

NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Grammars have exceptions

Information Systems - Special issue on semistructured data
Learning Information Extraction Rules for Semi-Structured and Free Text

Machine Learning - Special issue on natural language learning
Relational learning of pattern-match rules for information extraction

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Conceptual-model-based data extraction from multiple-record Web pages

Data & Knowledge Engineering
Machine Learning for Information Extraction in Informal Domains

Machine Learning - Special issue on information retrieval
Wrapper induction: efficiency and expressiveness

Artificial Intelligence - Special issue on Intelligent internet systems
Building intelligent web applications using lightweight wrappers

Data & Knowledge Engineering - Special issue on heterogeneous information resources need semantic access
A brief survey of web data extraction tools

ACM SIGMOD Record
Hierarchical Wrapper Induction for Semistructured Information Sources

Autonomous Agents and Multi-Agent Systems
WebOQL: Restructuring Documents, Databases, and Webs

ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
Visual Web Information Extraction with Lixto

Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Semistructured data: the TSIMMIS experience

ADBIS'97 Proceedings of the First East-European conference on Advances in Databases and Information systems

Joint optimization of wrapper generation and template detection

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Boosting text segmentation via progressive classification

Knowledge and Information Systems
Pictor: an interactive system for importing data from a website

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Cooperative CG-Wrappers for Web Content Extraction

ICCS '07 Proceedings of the 15th international conference on Conceptual Structures: Knowledge Architectures for Smart Applications
Can we learn a template-independent wrapper for news article extraction from a single training site?

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient record-level wrapper induction

Proceedings of the 18th ACM conference on Information and knowledge management
Scalable web data extraction for online market intelligence

Proceedings of the VLDB Endowment
Automatic extraction rules generation based on XPath pattern learning

WISS'10 Proceedings of the 2010 international conference on Web information systems engineering
Unsupervised user-generated content extraction by dependency relationships

WISE'11 Proceedings of the 12th international conference on Web information system engineering
Semantic web enabled information systems: personalized views on web data

ICCSA'05 Proceedings of the 2005 international conference on Computational Science and Its Applications - Volume Part II
The personal publication reader: illustrating web data extraction, personalization and reasoning for the semantic web

ESWC'05 Proceedings of the Second European conference on The Semantic Web: research and Applications
Integrating information extraction agents into a tourism recommender system

HAIS'10 Proceedings of the 5th international conference on Hybrid Artificial Intelligence Systems - Volume Part II
Information extraction for the semantic web

Proceedings of the First international conference on Reasoning Web
Datalog-Related aspects in lixto visual developer

Datalog'10 Proceedings of the First international conference on Datalog Reloaded
DEiXTo: a web data extraction suite

Proceedings of the 6th Balkan Conference in Informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Nowadays several companies use the information available on the Web for a number of purposes. However, since most of this information is only available as HTML documents, several techniques that allow information from the Web to be automatically extracted have recently been defined. In this paper we review the main techniques and tools for extracting information available on the Web, devising a taxonomy of existing systems. In particular we emphasize the advantages and drawbacks of the techniques analyzed from a user point of view.