Extracting lists of data records from semi-structured web pages

Authors:
Manuel Álvarez;Alberto Pan;Juan Raposo;Fernando Bellas;Fidel Cacheda
Affiliations:
Department of Information and Communication Technologies, University of A Coruña, Campus de Elviña s/n, 15071 A Coruña, Spain;Department of Information and Communication Technologies, University of A Coruña, Campus de Elviña s/n, 15071 A Coruña, Spain;Department of Information and Communication Technologies, University of A Coruña, Campus de Elviña s/n, 15071 A Coruña, Spain;Department of Information and Communication Technologies, University of A Coruña, Campus de Elviña s/n, 15071 A Coruña, Spain;Department of Information and Communication Technologies, University of A Coruña, Campus de Elviña s/n, 15071 A Coruña, Spain
Venue:
Data & Knowledge Engineering
Year:
2008

Citing 28
Cited 15

New indices for text: PAT Trees and PAT arrays

Information retrieval
WebL - a programming language for the Web

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Generating finite-state transducers for semi-structured data extraction from the Web

Information Systems - Special issue on semistructured data
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
IEPAD: information extraction based on pattern discovery

Proceedings of the 10th international conference on World Wide Web
Building intelligent web applications using lightweight wrappers

Data & Knowledge Engineering - Special issue on heterogeneous information resources need semantic access
Foundations of Databases: The Logical Level

Foundations of Databases: The Logical Level
A brief survey of web data extraction tools

ACM SIGMOD Record
Hierarchical Wrapper Induction for Semistructured Information Sources

Autonomous Agents and Multi-Agent Systems
Mining the Web: Discovering Knowledge from HyperText Data

Mining the Web: Discovering Knowledge from HyperText Data
Crawling the Hidden Web

Proceedings of the 27th International Conference on Very Large Data Bases
Visual Web Information Extraction with Lixto

Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Semi-Automatic Wrapper Generation for Commercial Web Sources

Proceedings of the IFIP TC8 / WG8.1 Working Conference on Engineering Information Systems in the Internet Context
On the Automatic Extraction of Data from the Hidden Web

Revised Papers from the HUMACS, DASWIS, ECOMO, and DAMA on ER 2001 Workshops
Data extraction and label assignment for web databases

WWW '03 Proceedings of the 12th international conference on World Wide Web
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Probe, Cluster, and Discover: Focused Extraction of QA-Pagelets from the Deep Web

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Using the structure of Web sites for automatic segmentation of tables

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Thresher: automating the unwrapping of semantic content from the World Wide Web

WWW '05 Proceedings of the 14th international conference on World Wide Web
Clustering web pages based on their structure

Data & Knowledge Engineering - Special issue: WIDM 2003
HW-STALKER: a machine learning-based system for transforming QURE-Pagelets to XML

Data & Knowledge Engineering
Structured Data Extraction from the Web Based on Partial Tree Alignment

IEEE Transactions on Knowledge and Data Engineering
Automatically maintaining wrappers for semi-structured web sources

Data & Knowledge Engineering
Semantic deep web: automatic attribute extraction from the deep web data sources

Proceedings of the 2007 ACM symposium on Applied computing
Crawling the content hidden behind web forms

ICCSA'07 Proceedings of the 2007 international conference on Computational science and Its applications - Volume Part II
Extracting web data using instance-based learning

WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
Semistructured data: the TSIMMIS experience

ADBIS'97 Proceedings of the First East-European conference on Advances in Databases and Information systems

Detecting data records in semi-structured web sites based on text token clustering

Integrated Computer-Aided Engineering
Crawling Deep Web Using a New Set Covering Algorithm

ADMA '09 Proceedings of the 5th International Conference on Advanced Data Mining and Applications
Information extraction for search engines using fast heuristic techniques

Data & Knowledge Engineering
Answering table augmentation queries from unstructured lists on the web

Proceedings of the VLDB Endowment
Providing resilient XPaths for external adaptation engines

Proceedings of the 21st ACM conference on Hypertext and hypermedia
Collective extraction from heterogeneous web lists

Proceedings of the fourth ACM international conference on Web search and data mining
Data extraction from web pages based on structural-semantic entropy

Proceedings of the 21st international conference companion on World Wide Web
Towards a method for unsupervised web information extraction

ICWE'12 Proceedings of the 12th international conference on Web Engineering
Clustering visually similar web page elements for structured web data extraction

ICWE'12 Proceedings of the 12th international conference on Web Engineering
TEX: An efficient and effective unsupervised Web information extractor

Knowledge-Based Systems
An unsupervised technique to extract information from semi-structured web pages

WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering
Towards web-scale structured web data extraction

Proceedings of the sixth ACM international conference on Web search and data mining
SearchResultFinder: federated search made easy

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Leveraging spatial join for robust tuple extraction from web pages

Information Sciences: an International Journal
Selecting queries from sample to crawl deep web data sources

Web Intelligence and Agent Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many web sources provide access to an underlying database containing structured data. These data can be usually accessed in HTML form only, which makes it difficult for software programs to obtain them in structured form. Nevertheless, web sources usually encode data records using a consistent template or layout, and the implicit regularities in the template can be used to automatically infer the structure and extract the data. In this paper, we propose a set of novel techniques to address this problem. While several previous works have addressed the same problem, most of them require multiple input pages while our method requires only one. In addition, previous methods make some assumptions about how data records are encoded into web pages, which do not always hold in real websites. Finally, we have also tested our techniques with a high number of real web sources and we have found them to be very effective.