Finding and Extracting Data Records from Web Pages

Authors:
Manuel Álvarez;Alberto Pan;Juan Raposo;Fernando Bellas;Fidel Cacheda
Affiliations:
Department of Information and Communications Technologies, University of A Coruña, A Coruña, Spain;Department of Information and Communications Technologies, University of A Coruña, A Coruña, Spain;Department of Information and Communications Technologies, University of A Coruña, A Coruña, Spain;Department of Information and Communications Technologies, University of A Coruña, A Coruña, Spain;Department of Information and Communications Technologies, University of A Coruña, A Coruña, Spain
Venue:
Journal of Signal Processing Systems
Year:
2010

Citing 25
Cited 1

New indices for text: PAT Trees and PAT arrays

Information retrieval
WebL - a programming language for the Web

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Generating finite-state transducers for semi-structured data extraction from the Web

Information Systems - Special issue on semistructured data
IEPAD: information extraction based on pattern discovery

Proceedings of the 10th international conference on World Wide Web
Building intelligent web applications using lightweight wrappers

Data & Knowledge Engineering - Special issue on heterogeneous information resources need semantic access
A brief survey of web data extraction tools

ACM SIGMOD Record
Hierarchical Wrapper Induction for Semistructured Information Sources

Autonomous Agents and Multi-Agent Systems
Mining the Web: Discovering Knowledge from HyperText Data

Mining the Web: Discovering Knowledge from HyperText Data
Crawling the Hidden Web

Proceedings of the 27th International Conference on Very Large Data Bases
Visual Web Information Extraction with Lixto

Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Semi-Automatic Wrapper Generation for Commercial Web Sources

Proceedings of the IFIP TC8 / WG8.1 Working Conference on Engineering Information Systems in the Internet Context
On the Automatic Extraction of Data from the Hidden Web

Revised Papers from the HUMACS, DASWIS, ECOMO, and DAMA on ER 2001 Workshops
Data extraction and label assignment for web databases

WWW '03 Proceedings of the 12th international conference on World Wide Web
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Thresher: automating the unwrapping of semantic content from the World Wide Web

WWW '05 Proceedings of the 14th international conference on World Wide Web
Clustering web pages based on their structure

Data & Knowledge Engineering - Special issue: WIDM 2003
HW-STALKER: a machine learning-based system for transforming QURE-Pagelets to XML

Data & Knowledge Engineering
Structured Data Extraction from the Web Based on Partial Tree Alignment

IEEE Transactions on Knowledge and Data Engineering
Automatically maintaining wrappers for semi-structured web sources

Data & Knowledge Engineering
Semantic deep web: automatic attribute extraction from the deep web data sources

Proceedings of the 2007 ACM symposium on Applied computing
Finding and extracting data records from web pages

EUC'07 Proceedings of the 2007 international conference on Embedded and ubiquitous computing
Crawling the content hidden behind web forms

ICCSA'07 Proceedings of the 2007 international conference on Computational science and Its applications - Volume Part II
Extracting web data using instance-based learning

WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
Semistructured data: the TSIMMIS experience

ADBIS'97 Proceedings of the First East-European conference on Advances in Databases and Information systems

A language for end-user web augmentation: Caring for producers and consumers alike

ACM Transactions on the Web (TWEB)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many HTML pages are generated by software programs by querying some underlying databases and then filling in a template with the data. In these situations the metainformation about the data structure is lost, so automated software programs cannot process these data in such powerful manners as information from databases. We propose a set of novel techniques for detecting structured records in a web page and extracting the data values that constitute them. Our method needs only an input page. It starts by identifying the data region of interest in the page. Then it is partitioned into records by using a clustering method that groups similar subtrees in the DOM tree of the page. Finally, the attributes of the data records are extracted by using a method based on multiple string alignment. We have tested our techniques with a high number of real web sources, obtaining high precision and recall values.