A brief survey of web data extraction tools

Authors:
Alberto H. F. Laender;Berthier A. Ribeiro-Neto;Altigran S. da Silva;Juliana S. Teixeira
Affiliations:
Federal University of Minas Gerais, Belo Horizonte MG Brazil;Federal University of Minas Gerais, Belo Horizonte MG Brazil;Federal University of Minas Gerais, Belo Horizonte MG Brazil;Federal University of Minas Gerais, Belo Horizonte MG Brazil
Venue:
ACM SIGMOD Record
Year:
2002

Citing 27
Cited 156

Template-based wrappers in the TSIMMIS system

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
The Araneus Web-based management system

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Database techniques for the World-Wide Web: a survey

ACM SIGMOD Record
Record-boundary discovery in Web documents

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Generating finite-state transducers for semi-structured data extraction from the Web

Information Systems - Special issue on semistructured data
Grammars have exceptions

Information Systems - Special issue on semistructured data
Managing semistructured data with florid: a deductive object-oriented perspective

Information Systems - Special issue on semistructured data
Learning Information Extraction Rules for Semi-Structured and Free Text

Machine Learning - Special issue on natural language learning
Relational learning of pattern-match rules for information extraction

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Extracting semi-structured data through examples

Proceedings of the eighth international conference on Information and knowledge management
Conceptual-model-based data extraction from multiple-record Web pages

Data & Knowledge Engineering
Machine Learning for Information Extraction in Informal Domains

Machine Learning - Special issue on information retrieval
Wrapper induction: efficiency and expressiveness

Artificial Intelligence - Special issue on Intelligent internet systems
Building intelligent web applications using lightweight wrappers

Data & Knowledge Engineering - Special issue on heterogeneous information resources need semantic access
Bootstrapping for example-based data extraction

Proceedings of the tenth international conference on Information and knowledge management
DEByE - Date extraction by example

Data & Knowledge Engineering
Hierarchical Wrapper Induction for Semistructured Information Sources

Autonomous Agents and Multi-Agent Systems
Object Exchange Across Heterogeneous Information Sources

ICDE '95 Proceedings of the Eleventh International Conference on Data Engineering
Querying Semi-Structured Data

ICDT '97 Proceedings of the 6th International Conference on Database Theory
WebOQL: Restructuring Documents, Databases, and Webs

ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
Visual Web Information Extraction with Lixto

Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Jedi: Extracting and Synthesizing Information from the Web

COOPIS '98 Proceedings of the 3rd IFCIS International Conference on Cooperative Information Systems
X-tract: Structure Extraction from Botanical Textual Descriptions

SPIRE '99 Proceedings of the String Processing and Information Retrieval Symposium & International Workshop on Groupware
XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Semistructured data: the TSIMMIS experience

ADBIS'97 Proceedings of the First East-European conference on Advances in Databases and Information systems

The Debye Environment for Web Data Management

IEEE Internet Computing
A Framework for Generating Attribute Extractors for Web Data Sources

SPIRE 2002 Proceedings of the 9th International Symposium on String Processing and Information Retrieval
Toolkits for Generating Wrappers

NODe '02 Revised Papers from the International Conference NetObjectDays on Objects, Components, Architectures, Services, and Applications for a Networked World
The Web-DL environment for building digital libraries from the Web

Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
An alternative architecture for financial data integration

Communications of the ACM - New architectures for financial services
Automatic web news extraction using tree edit distance

Proceedings of the 13th international conference on World Wide Web
Automatic generation of agents for collecting hidden web pages for data extraction

Data & Knowledge Engineering - Special issue: WIDM 2002
Towards building logical views of websites

Data & Knowledge Engineering - Special issue: WIDM 2002
Toward semantic understanding: an approach based on information extraction ontologies

ADC '04 Proceedings of the 15th Australasian database conference - Volume 27
Testbed for information extraction from deep web

Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters
Personalized Web Views for Multilingual Web Sources

IEEE Internet Computing
Automatic composite wrapper generation for semi-structured biological data based on table structure identification

ACM SIGMOD Record
Clip, connect, clone: combining application elements to build custom interfaces for information access

Proceedings of the 17th annual ACM symposium on User interface software and technology
OLERA: Semisupervised Web-Data Extraction with Visual Support

IEEE Intelligent Systems
Extracting relational data from HTML repositories

ACM SIGKDD Explorations Newsletter
WISDOM: Web Intrapage Informative Structure Mining Based on Document Object Model

IEEE Transactions on Knowledge and Data Engineering
Bootstrapping Semantic Annotation for Content-Rich HTML Documents

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Fully automatic wrapper generation for search engines

WWW '05 Proceedings of the 14th international conference on World Wide Web
Browsing fatigue in handhelds: semantic bookmarking spells relief

WWW '05 Proceedings of the 14th international conference on World Wide Web
Interactive web-wrapper construction for extracting relational information from web documents

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Automatic wrapper maintenance for semi-structured web sources using results from previous queries

Proceedings of the 2005 ACM symposium on Applied computing
ViPER: augmenting automatic information extraction with visual perceptions

Proceedings of the 14th ACM international conference on Information and knowledge management
Adaptive web information extraction

Communications of the ACM - Two decades of the language-action perspective
Semantic analytics on social networks: experiences in addressing the problem of conflict of interest detection

Proceedings of the 15th international conference on World Wide Web
Model-directed web transactions under constrained modalities

Proceedings of the 15th international conference on World Wide Web
Interactive wrapper generation with minimal user effort

Proceedings of the 15th international conference on World Wide Web
Robust web content extraction

Proceedings of the 15th international conference on World Wide Web
Documentum ECI self-repairing wrappers: performance analysis

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
From HTML documents to web tables and rules

ICEC '06 Proceedings of the 8th international conference on Electronic commerce: The new e-commerce: innovations for conquering current barriers, obstacles and limitations to conducting successful business on the internet
A two-phase rule generation and optimization approach for wrapper generation

ADC '06 Proceedings of the 17th Australasian Database Conference - Volume 49
A Survey of Web Information Extraction Systems

IEEE Transactions on Knowledge and Data Engineering
Automatic extraction of dynamic record sections from search engine result pages

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
MyPortal: robust extraction and aggregation of web content

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Exploiting web browsing histories to identify user needs

Proceedings of the 12th international conference on Intelligent user interfaces
Web wrapper induction: a brief survey

AI Communications
Automatically maintaining wrappers for semi-structured web sources

Data & Knowledge Engineering
Making mashups with marmite: towards end-user programming for the web

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Supporting end-users in the creation of dependable web clips

Proceedings of the 16th international conference on World Wide Web
Adaptive record extraction from web pages

Proceedings of the 16th international conference on World Wide Web
Interactive Tuples Extraction from Semi-Structured Data

WI '06 Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
Protection Techniques from Information Extraction

WI '06 Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
FLUX-CIM: flexible unsupervised extraction of citation metadata

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Data Extraction From Repositories On The Web: A Semi-Automatic Approach

Journal of Integrated Design & Process Science
Mining templates from search result records of search engines

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Joint optimization of wrapper generation and template detection

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Model-directed Web transactions under constrained modalities

ACM Transactions on the Web (TWEB)
Automatically maintaining navigation sequences for querying semi-structured web sources

Data & Knowledge Engineering
Adaptive web-page content identification

Proceedings of the 9th annual ACM international workshop on Web information and data management
Discovering geographic locations in web pages using urban addresses

Proceedings of the 4th ACM workshop on Geographical information retrieval
Context-aware wrapping: synchronized data extraction

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Declarative information extraction using datalog with embedded extraction predicates

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Scalable semantic analytics on social networks for addressing the problem of conflict of interest detection

ACM Transactions on the Web (TWEB)
Extracting lists of data records from semi-structured web pages

Data & Knowledge Engineering
A wrapper generation system for PDF documents

Proceedings of the 2008 ACM symposium on Applied computing
Perception-oriented online news extraction

Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
Pictor: an interactive system for importing data from a website

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Detecting data records in semi-structured web sites based on text token clustering

Integrated Computer-Aided Engineering
Caravela: Semantic Content Management with Automatic Information Integration and Categorization (System Description)

ESWC '07 Proceedings of the 4th European conference on The Semantic Web: Research and Applications
Cooperative CG-Wrappers for Web Content Extraction

ICCS '07 Proceedings of the 15th international conference on Conceptual Structures: Knowledge Architectures for Smart Applications
WRAPPER INFERENCE FOR AMBIGUOUS WEB PAGES

Applied Artificial Intelligence
USING GRAMMATICAL INFERENCE TECHNIQUES TO LEARN ONTOLOGIES THAT DESCRIBE THE STRUCTURE OF DOMAIN INSTANCES

Applied Artificial Intelligence
Automated Semantic Analysis of Schematic Data

World Wide Web
Towards a System for Ontology-Based Information Extraction from PDF Documents

OTM '08 Proceedings of the OTM 2008 Confederated International Conferences, CoopIS, DOA, GADA, IS, and ODBASE 2008. Part II on On the Move to Meaningful Internet Systems
Extracting geographic features from the Internet to automatically build detailed regional gazetteers

International Journal of Geographical Information Science
Attaching UI enhancements to websites with end users

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Extracting article text from the web with maximum subsequence segmentation

Proceedings of the 18th international conference on World wide web
Extracting data records from the web using tag path clustering

Proceedings of the 18th international conference on World wide web
Can we learn a template-independent wrapper for news article extraction from a single training site?

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Template-independent wrapper for web forums

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
A Structured Approach to Data Reverse Engineering of Web Applications

ICWE '9 Proceedings of the 9th International Conference on Web Engineering
Deriving image-text document surrogates to optimize cognition

Proceedings of the 9th ACM symposium on Document engineering
Web document text and images extraction using DOM analysis and natural language processing

Proceedings of the 9th ACM symposium on Document engineering
Site-Wide Wrapper Induction for Life Science Deep Web Databases

DILS '09 Proceedings of the 6th International Workshop on Data Integration in the Life Sciences
Algorithm for Extracting Loosely Structured Data Records Through Digging Strict Patterns

World Wide Web
Template-independent news extraction based on visual consistency

AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 2
Extracting informative images from web news pages via imbalanced classification

MM '09 Proceedings of the 17th ACM international conference on Multimedia
Efficient record-level wrapper induction

Proceedings of the 18th ACM conference on Information and knowledge management
Automatic web data extraction using tree alignment

Proceedings of the 18th ACM conference on Information and knowledge management
A fast and simple method for extracting relevant content from news webpages

Proceedings of the 18th ACM conference on Information and knowledge management
Web news categorization using a cross-media document graph

Proceedings of the ACM International Conference on Image and Video Retrieval
Information extraction for search engines using fast heuristic techniques

Data & Knowledge Engineering
Scalable web data extraction for online market intelligence

Proceedings of the VLDB Endowment
Automated Ontology-Driven Metasearch Generation with Metamorph

WISE '09 Proceedings of the 10th International Conference on Web Information Systems Engineering
Wrapping of Web Sources with restricted Query Interfaces by Query Tunneling

Electronic Notes in Theoretical Computer Science (ENTCS)
An information extraction approach to reorganizing and summarizing specifications

Information and Software Technology
An adaptive bottom up clustering approach for web news extraction

WOCC'09 Proceedings of the 18th international conference on Wireless and Optical Communications Conference
Visual extraction of information from web pages

Journal of Visual Languages and Computing
Finding and Extracting Data Records from Web Pages

Journal of Signal Processing Systems
Semantic Web Mining

Web Semantics: Science, Services and Agents on the World Wide Web
Automatic extraction of clickable structured web contents for name entity queries

Proceedings of the 19th international conference on World wide web
Finding and extracting data records from web pages

EUC'07 Proceedings of the 2007 international conference on Embedded and ubiquitous computing
Using clustering and edit distance techniques for automatic web data extraction

WISE'07 Proceedings of the 8th international conference on Web information systems engineering
Providing personalized mashups within the context of existing web applications

WISE'07 Proceedings of the 8th international conference on Web information systems engineering
Enriching OWL with instance recognition semantics for automated semantic annotation

ER'07 Proceedings of the 2007 conference on Advances in conceptual modeling: foundations and applications
Labeling data extracted from the web

OTM'07 Proceedings of the 2007 OTM Confederated international conference on On the move to meaningful internet systems: CoopIS, DOA, ODBASE, GADA, and IS - Volume Part I
A method for web information extraction

APWeb'08 Proceedings of the 10th Asia-Pacific web conference on Progress in WWW research and development
No Code Required: Giving Users Tools to Transform the Web

No Code Required: Giving Users Tools to Transform the Web
Using latent-structure to detect objects on the web

Procceedings of the 13th International Workshop on the Web and Databases
Document structure meets page layout: loopy random fields for web news content extraction

Proceedings of the 10th ACM symposium on Document engineering
Automatic extraction of web data records containing user-generated content

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
AnCaraS: a new webometrics web-spider: G-DEVS-based validation of concepts

SpringSim '10 Proceedings of the 2010 Spring Simulation Multiconference
Ranking web sites using domain ontology concepts

Information and Management
On the complexity of regular-grammars with integer attributes

Journal of Computer and System Sciences
On-line web database integration

Proceedings of the International Conference on Management of Emergent Digital EcoSystems
A novel method for bilingual web page acquisition from search engine web records

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Shallow information extraction from medical forum data

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
The OXPath to success in the deep web

Proceedings of the 20th international conference companion on World wide web
Otium: A web based planner for tourism and leisure

Expert Systems with Applications: An International Journal
A Bayesian network modeling approach for cross media analysis

Image Communication
How the minotaur turned into ariadne: ontologies in web data extraction

ICWE'11 Proceedings of the 11th international conference on Web engineering
Little knowledge rules the web: domain-centric result page extraction

RR'11 Proceedings of the 5th international conference on Web reasoning and rule systems
Intelligent self-repairable web wrappers

AI*IA'11 Proceedings of the 12th international conference on Artificial intelligence around man and beyond
A simhash-based scheme for locating product information from the web

Proceedings of the Second Symposium on Information and Communication Technology
RDFa based annotation of web pages through keyphrases extraction

OTM'11 Proceedings of the 2011th Confederated international conference on On the move to meaningful internet systems - Volume Part II
Hybrid method for automated news content extraction from the web

WISE'06 Proceedings of the 7th international conference on Web Information Systems
A logic-based tool for semantic information extraction

JELIA'06 Proceedings of the 10th European conference on Logics in Artificial Intelligence
Integrating semi-structured data into business applications: a web intelligence example

WM'05 Proceedings of the Third Biennial conference on Professional Knowledge Management
Semantic web enabled information systems: personalized views on web data

ICCSA'05 Proceedings of the 2005 international conference on Computational Science and Its Applications - Volume Part II
PIES: a web information extraction system using ontology and tag patterns

WAIM'05 Proceedings of the 6th international conference on Advances in Web-Age Information Management
Using a more powerful teacher to reduce the number of queries of the l* algorithm in practical applications

EPIA'05 Proceedings of the 12th Portuguese conference on Progress in Artificial Intelligence
Decomposition-Based optimization of reload strategies in the world wide web

WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
Towards more personalized web: extraction and integration of dynamic content from the web

APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
An enhanced spreadsheet supporting calculation-structure variants, and its application to web-based processing

Proceedings of the 2005 international conference on Federation over the Web
Information aggregation using the caméléon# web wrapper

EC-Web'05 Proceedings of the 6th international conference on E-Commerce and Web Technologies
The personal publication reader: illustrating web data extraction, personalization and reasoning for the semantic web

ESWC'05 Proceedings of the Second European conference on The Semantic Web: research and Applications
Wrapping PDF documents exploiting uncertain knowledge

CAiSE'06 Proceedings of the 18th international conference on Advanced Information Systems Engineering
CCWrapper: adaptive predefined schema guided web extraction

WAIM '06 Proceedings of the 7th international conference on Advances in Web-Age Information Management
An algorithm of online goods information extraction with two-stage working pattern

FSKD'05 Proceedings of the Second international conference on Fuzzy Systems and Knowledge Discovery - Volume Part I
Automatic wrapper generation for metasearch using ordered tree structured patterns

AI'04 Proceedings of the 17th Australian joint conference on Advances in Artificial Intelligence
Automatic data extraction from data-rich web pages

DASFAA'05 Proceedings of the 10th international conference on Database Systems for Advanced Applications
Information extraction for the semantic web

Proceedings of the First international conference on Reasoning Web
Ontology creation: extraction of domain knowledge from web documents

ER'05 Proceedings of the 24th international conference on Conceptual Modeling
Logic wrappers and XSLT transformations for tuples extraction from HTML

XSym'05 Proceedings of the Third international conference on Database and XML Technologies
Preloading browsers for optimizing automatic access to hidden web: a ranking-based repository solution

ADBIS'06 Proceedings of the 10th East European conference on Advances in Databases and Information Systems
Document interrogation: architecture, information extraction and approximate answers

EDBT'06 Proceedings of the 2006 international conference on Current Trends in Database Technology
Optimization of automatic navigation to hidden web pages by ranking-based browser preloading

DEECS'06 Proceedings of the Second international conference on Data Engineering Issues in E-Commerce and Services
Maintaining web navigation flows for wrappers

DEECS'06 Proceedings of the Second international conference on Data Engineering Issues in E-Commerce and Services
Mining travel resources on the web using l-wrappers

ICAISC'06 Proceedings of the 8th international conference on Artificial Intelligence and Soft Computing
Chapter 6: web data extraction for service creation

Search Computing
The HiLeX system for semantic information extraction

Transactions on Large-Scale Data- and Knowledge-Centered Systems V
Datalog-Related aspects in lixto visual developer

Datalog'10 Proceedings of the First international conference on Datalog Reloaded
Hybrid reasoning for web services discovery

RED'10 Proceedings of the Third international conference on Resource Discovery
Data extraction from web pages based on structural-semantic entropy

Proceedings of the 21st international conference companion on World Wide Web
Visual oXPath: robust wrapping by example

Proceedings of the 21st international conference companion on World Wide Web
Extracting multiple news attributes based on visual features

Journal of Intelligent Information Systems
Sift: an end-user tool for gathering web content on the go

Proceedings of the 2012 ACM symposium on Document engineering
Extracting informative textual parts from web pages containing user-generated content

Proceedings of the 12th International Conference on Knowledge Management and Knowledge Technologies
Ontology-based access to probabilistic data with OWL QL

ISWC'12 Proceedings of the 11th international conference on The Semantic Web - Volume Part I
On text preprocessing for opinion mining outside of laboratory environments

AMT'12 Proceedings of the 8th international conference on Active Media Technology
ELIxIR: Expertise Learning and Identification x Information Retrieval

International Journal of Information Systems and Social Change
A general theory of spatial relations to support a graphical tool for visual information extraction

Journal of Visual Languages and Computing
A reverse engineering approach for automatic annotation of Web pages

Multimedia Tools and Applications
A pattern-based selective recrawling approach for object-level vertical search

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Self-supervised automated wrapper generation for weblog data extraction

BNCOD'13 Proceedings of the 29th British National conference on Big Data
Scalable and noise tolerant web knowledge extraction for search task simplification

Decision Support Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

In the last few years, several works in the literature have addressed the problem of data extraction from Web pages. The importance of this problem derives from the fact that, once extracted, the data can be handled in a way similar to instances of a traditional database. The approaches proposed in the literature to address the problem of Web data extraction use techniques borrowed from areas such as natural language processing, languages and grammars, machine learning, information retrieval, databases, and ontologies. As a consequence, they present very distinct features and capabilities which make a direct comparison difficult to be done. In this paper, we propose a taxonomy for characterizing Web data extraction fools, briefly survey major Web data extraction tools described in the literature, and provide a qualitative analysis of them. Hopefully, this work will stimulate other studies aimed at a more comprehensive analysis of data extraction approaches and tools for Web data.