Site-Wide Wrapper Induction for Life Science Deep Web Databases

Authors:
Saqib Mir;Steffen Staab;Isabel Rojas
Affiliations:
EML Research, Heidelberg, Germany D-69118 and Institute for Computer Science, University of Koblenz-Landau, Koblenz, Germany D-56016;Institute for Computer Science, University of Koblenz-Landau, Koblenz, Germany D-56016;EML Research, Heidelberg, Germany D-69118
Venue:
DILS '09 Proceedings of the 6th International Workshop on Data Integration in the Life Sciences
Year:
2009

Citing 24
Cited 1

Building intelligent web applications using lightweight wrappers

Data & Knowledge Engineering - Special issue on heterogeneous information resources need semantic access
A brief survey of web data extraction tools

ACM SIGMOD Record
Mining the Web's Link Structure

Computer
Automatic information extraction from semi-structured Web pages by pattern discovery

Decision Support Systems - Web retrieval and mining
Visual Web Information Extraction with Lixto

Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Data extraction and label assignment for web databases

WWW '03 Proceedings of the 12th international conference on World Wide Web
Statistical schema matching across web query interfaces

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Automatic web news extraction using tree edit distance

Proceedings of the 13th international conference on World Wide Web
Understanding Web query interfaces: best-effort parsing with hidden syntax

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Organizing structured web sources by query schemas: a clustering approach

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Mining semantics for large scale integration on the web: evidences, insights, and challenges

ACM SIGKDD Explorations Newsletter
Corpus-Based Schema Matching

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Fully automatic wrapper generation for search engines

WWW '05 Proceedings of the 14th international conference on World Wide Web
Clustering web pages based on their structure

Data & Knowledge Engineering - Special issue: WIDM 2003
ViPER: augmenting automatic information extraction with visual perceptions

Proceedings of the 14th ACM international conference on Information and knowledge management
Query Routing: Finding Ways in the Maze of the DeepWeb

WIRI '05 Proceedings of the International Workshop on Challenges in Web Information Retrieval and Integration
WebIQ: Learning from the Web to Match Deep-Web Query Interfaces

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Accessing the web: from search to integration

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Structure-driven crawler generation by example

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Clustering e-commerce search engines based on their search interface pages using WISE-cluster

Data & Knowledge Engineering - Special issue: WIDM 2004
Wise-integrator: an automatic integrator of web search interfaces for E-commerce

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Instance-based schema matching for web databases by domain-specific query probing

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Automatic wrapper generation using tree matching and partial tree alignment

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2

An unsupervised approach for acquiring ontologies and RDF data from online life science databases

ESWC'10 Proceedings of the 7th international conference on The Semantic Web: research and Applications - Volume Part II

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a novel approach to automatic information extraction from Deep Web Life Science databases using wrapper induction. Traditional wrapper induction techniques focus on learning wrappers based on examples from one class of Web pages, i.e. from Web pages that are all similar in structure and content. Thereby, traditional wrapper induction targets the understanding of Web pages generated from a database using the same generation template as observed in the example set. However, Life Science Web sites typically contain structurally diverse web pages from multiple classes making the problem more challenging. Furthermore, we observed that such Life Science Web sites do not just provide mere data, but they also tend to provide schema information in terms of data labels --- giving further cues for solving the Web site wrapping task. Our solution to this novel challenge of Site-Wide wrapper induction consists of a sequence of steps: 1. classification of similar Web pages into classes, 2. discovery of these classes and 3. wrapper induction for each class. Our approach thus allows us to perform unsupervised information retrieval from across an entire Web site. We test our algorithm against three real-world biochemical deep Web sources and report our preliminary results, which are very promising.