Data extraction and label assignment for web databases

Authors:
Jiying Wang;Fred H. Lochovsky
Affiliations:
University of Science and Technology Clear Water Bay, Kowloon, Hong Kong;University of Science and Technology Clear Water Bay, Kowloon, Hong Kong
Venue:
WWW '03 Proceedings of the 12th international conference on World Wide Web
Year:
2003

Citing 14
Cited 97

Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Database techniques for the World-Wide Web: a survey

ACM SIGMOD Record
A hierarchical approach to wrapper induction

Proceedings of the third annual conference on Autonomous Agents
Record-boundary discovery in Web documents

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Generating finite-state transducers for semi-structured data extraction from the Web

Information Systems - Special issue on semistructured data
Extracting semi-structured data through examples

Proceedings of the eighth international conference on Information and knowledge management
IEPAD: information extraction based on pattern discovery

Proceedings of the 10th international conference on World Wide Web
Crawling the Hidden Web

Proceedings of the 27th International Conference on Very Large Data Bases
Visual Web Information Extraction with Lixto

Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Data-rich Section Extraction from HTML pages

WISE '02 Proceedings of the 3rd International Conference on Web Information Systems Engineering
XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
A Fully Automated Object Extraction System for the World Wide Web

ICDCS '01 Proceedings of the The 21st International Conference on Distributed Computing Systems

Fine-grain web site structure discovery

WIDM '03 Proceedings of the 5th ACM international workshop on Web information and data management
Guiding queries to information sources with InfoBeacons

Proceedings of the 5th ACM/IFIP/USENIX international conference on Middleware
Fully automatic wrapper generation for search engines

WWW '05 Proceedings of the 14th international conference on World Wide Web
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
An information extraction engine for web discussion forums

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Clustering web pages based on their structure

Data & Knowledge Engineering - Special issue: WIDM 2003
Retrieving answers from frequently asked questions pages on the web

Proceedings of the 14th ACM international conference on Information and knowledge management
ViPER: augmenting automatic information extraction with visual perceptions

Proceedings of the 14th ACM international conference on Information and knowledge management
Acquiring owl ontologies from data-intensive web sites

ICWE '06 Proceedings of the 6th international conference on Web engineering
A Survey of Web Information Extraction Systems

IEEE Transactions on Knowledge and Data Engineering
Automatic extraction of dynamic record sections from search engine result pages

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Enabling web browsers to augment web sites' filtering and sorting functionalities

UIST '06 Proceedings of the 19th annual ACM symposium on User interface software and technology
Structured Data Extraction from the Web Based on Partial Tree Alignment

IEEE Transactions on Knowledge and Data Engineering
Adapting Web information extraction knowledge via mining site-invariant and site-dependent features

ACM Transactions on Internet Technology (TOIT)
Semantic Labeling of Data by Using the Web

WI-IATW '06 Proceedings of the 2006 IEEE/WIC/ACM international conference on Web Intelligence and Intelligent Agent Technology
Automatically maintaining wrappers for semi-structured web sources

Data & Knowledge Engineering
Web object retrieval

Proceedings of the 16th international conference on World Wide Web
Towards Deeper Understanding of the Search Interfaces of the Deep Web

World Wide Web
Extraction of flat and nested data records from web pages

AusDM '06 Proceedings of the fifth Australasian conference on Data mining and analystics - Volume 61
Mining templates from search result records of search engines

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Routing Queries through a Peer-to-Peer InfoBeacons Network Using Information Retrieval Techniques

IEEE Transactions on Parallel and Distributed Systems
Instance-based schema matching for web databases by domain-specific query probing

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Declarative information extraction using datalog with embedded extraction predicates

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Extracting lists of data records from semi-structured web pages

Data & Knowledge Engineering
Pictor: an interactive system for importing data from a website

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Extracting Personalised Ontology from Data-Intensive Web Application: an HTML Forms-Based Reverse Engineering Approach

Informatica
A Workflow-Based Approach for Creating Complex Web Wrappers

WISE '08 Proceedings of the 9th international conference on Web Information Systems Engineering
Bootstrapping Information Extraction from Semi-structured Web Pages

ECML PKDD '08 Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases - Part I
Extracting data records from the web using tag path clustering

Proceedings of the 18th international conference on World wide web
ODE: Ontology-assisted data extraction

ACM Transactions on Database Systems (TODS)
Semantic and pragmatic annotation for government information discovery, sharing and collaboration

Proceedings of the 10th Annual International Conference on Digital Government Research: Social Networks: Making Connections between Citizens, Data and Government
Cross Language Information Extraction Knowledge Adaptation

RSKT '09 Proceedings of the 4th International Conference on Rough Sets and Knowledge Technology
Automatic wrapper generation using tree matching and partial tree alignment

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
Deriving image-text document surrogates to optimize cognition

Proceedings of the 9th ACM symposium on Document engineering
Site-Wide Wrapper Induction for Life Science Deep Web Databases

DILS '09 Proceedings of the 6th International Workshop on Data Integration in the Life Sciences
Algorithm for Extracting Loosely Structured Data Records Through Digging Strict Patterns

World Wide Web
Efficient record-level wrapper induction

Proceedings of the 18th ACM conference on Information and knowledge management
Post processing wrapper generated tables for labeling anonymous datasets

Proceedings of the eleventh international workshop on Web information and data management
Information extraction for search engines using fast heuristic techniques

Data & Knowledge Engineering
A hierarchical approach to model web query interfaces for web source integration

Proceedings of the VLDB Endowment
Harvesting relational tables from lists on the web

Proceedings of the VLDB Endowment
FastWrap: an efficient wrapper for tabular data extraction from the web

IRI'09 Proceedings of the 10th IEEE international conference on Information Reuse & Integration
Automated Ontology-Driven Metasearch Generation with Metamorph

WISE '09 Proceedings of the 10th International Conference on Web Information Systems Engineering
Wrapping of Web Sources with restricted Query Interfaces by Query Tunneling

Electronic Notes in Theoretical Computer Science (ENTCS)
Finding and Extracting Data Records from Web Pages

Journal of Signal Processing Systems
Bottom-up discovery of clusters of maximal ranges in HTML trees for search engines results extraction

BIS'07 Proceedings of the 10th international conference on Business information systems
Using structured tokens to identify webpages for data extraction

APWeb/WAIM'07 Proceedings of the joint 9th Asia-Pacific web and 8th international conference on web-age information management conference on Advances in data and web management
Wikipedia driven autonomous label assignment in wrapper induced tables with missing column names

Proceedings of the 2010 ACM Symposium on Applied Computing
Labeling data extracted from the web

OTM'07 Proceedings of the 2007 OTM Confederated international conference on On the move to meaningful internet systems: CoopIS, DOA, ODBASE, GADA, and IS - Volume Part I
An effective method supporting data extraction and schema recognition on deep web

APWeb'08 Proceedings of the 10th Asia-Pacific web conference on Progress in WWW research and development
An Intelligent information segmentation approach to extract financial data for business valuation

Expert Systems with Applications: An International Journal
Web data extraction system based on label library

FSKD'09 Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 7
Exploiting tree structure of a web page for clustering

International Journal of Knowledge and Web Intelligence
A personal mashup framework for mobile users

Proceedings of the 7th International Conference on Advances in Mobile Computing and Multimedia
Web page analysis based on HTML DOM and its usage for forum statistics and alerts

ECC'10 Proceedings of the 4th conference on European computing conference
Finding information in an era of abundance: Towards a collaborative tagging environment in government

Information Polity - Government 2.0: Making Connections between citizens, data and government
Understanding deep web search interfaces: a survey

ACM SIGMOD Record
Web page analysis based on HTML DOM and its usage for forum statistics, alerts and geo targeted data retrieval

WSEAS Transactions on Computers
Automatic extraction of web data records containing user-generated content

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Web database schema identification through simple query interface

RED'09 Proceedings of the 2nd international conference on Resource discovery
ObjectRunner: lightweight, targeted extraction and querying of structured web data

Proceedings of the VLDB Endowment
Encapsulating multi-stepped web forms as web services

ICSOC/ServiceWave'09 Proceedings of the 2009 international conference on Service-oriented computing
Materializing multi-relational databases from the web using taxonomic queries

Proceedings of the fourth ACM international conference on Web search and data mining
Harvesting relational tables from lists on the web

The VLDB Journal — The International Journal on Very Large Data Bases
A framework for automatic annotation of web pages using the Google rich snippets vocabulary

Proceedings of the 2011 ACM Symposium on Applied Computing
An approach to assess the quality of web pages in the deep web

DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications
Ontology development for the semantic web: an html form-based reverse engineering approach

Journal of Web Engineering
Developer-friendly annotation-based HTML-to-XML transformation technology

Proceedings of the 11th ACM symposium on Document engineering
Little knowledge rules the web: domain-centric result page extraction

RR'11 Proceedings of the 5th international conference on Web reasoning and rule systems
An indent shape based approach for web lists mining

WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
Concluding pattern of web page based on string pattern matching

WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
Towards a unified solution: data record region detection and segmentation

Proceedings of the 20th ACM international conference on Information and knowledge management
Extracting data records from query result pages based on visual features

BNCOD'11 Proceedings of the 28th British national conference on Advances in databases
Automatic hierarchical classification of structured deep web databases

WISE'06 Proceedings of the 7th international conference on Web Information Systems
Hybrid method for automated news content extraction from the web

WISE'06 Proceedings of the 7th international conference on Web Information Systems
Metadata inference for document retrieval in a distributed repository

ASIAN'04 Proceedings of the 9th Asian Computing Science conference on Advances in Computer Science: dedicated to Jean-Louis Lassez on the Occasion of His 5th Cycle Birthday
Constructing interface schemas for search interfaces of web databases

WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
NET – a system for extracting web data from flat and nested data records

WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
RecipeCrawler: collecting recipe data from WWW incrementally

WAIM '06 Proceedings of the 7th international conference on Advances in Web-Age Information Management
Using information retrieval techniques to route queries in an infobeacons network

DBISP2P'04 Proceedings of the Second international conference on Databases, Information Systems, and Peer-to-Peer Computing
Automatic data extraction from data-rich web pages

DASFAA'05 Proceedings of the 10th international conference on Database Systems for Advanced Applications
Automated migration of data-intensive web pages into ontology-based semantic web: a reverse engineering approach

OTM'05 Proceedings of the 2005 OTM Confederated international conference on On the Move to Meaningful Internet Systems: CoopIS, COA, and ODBASE - Volume Part II
Bootstrapping domain ontology for semantic web services from source web sites

TES'05 Proceedings of the 6th international conference on Technologies for E-Services
Data extraction from web pages based on structural-semantic entropy

Proceedings of the 21st international conference companion on World Wide Web
AMBER: turning annotations into knowledge

Proceedings of the 21st international conference companion on World Wide Web
ProFoUnd: program-analysis-based form understanding

Proceedings of the 21st international conference companion on World Wide Web
Automatically extracting user reviews from forum sites

Computers & Mathematics with Applications
Learning to adapt cross language information extraction wrapper

Applied Intelligence
Peer matrix alignment: a new algorithm

PAKDD'12 Proceedings of the 16th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part II
Learning to discover complex mappings from web forms to ontologies

Proceedings of the 21st ACM international conference on Information and knowledge management
RUBIX: a framework for improving data integration with linked data

Proceedings of the First International Workshop on Open Data
TEX: An efficient and effective unsupervised Web information extractor

Knowledge-Based Systems
Assessing relevance and trust of the deep web sources and results based on inter-source agreement

ACM Transactions on the Web (TWEB)
Visually extracting data records from the deep web

Proceedings of the 22nd international conference on World Wide Web companion
Web news extraction via path ratios

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Robust detection of semi-structured web records using a DOM structure-knowledge-driven model

ACM Transactions on the Web (TWEB)
A learning classifier-based approach to aligning data items and labels

BNCOD'13 Proceedings of the 29th British National conference on Big Data

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many tools have been developed to help users query, extract and integrate data from web pages generated dynamically from databases, i.e., from the Hidden Web. A key prerequisite for such tools is to obtain the schema of the attributes of the retrieved data. In this paper, we describe a system called, DeLa, which reconstructs (part of) a "hidden" back-end web database. It does this by sending queries through HTML forms, automatically generating regular expression wrappers to extract data objects from the result pages and restoring the retrieved data into an annotated (labelled) table. The whole process needs no human involvement and proves to be fast (less than one minute for wrapper induction for each site) and accurate (over 90% correctness for data extraction and around 80% correctness for label assignment).