An unsupervised technique to extract information from semi-structured web pages

Authors:
Hassan A. Sleiman;Rafael Corchuelo
Affiliations:
University of Sevilla, Spain;University of Sevilla, Spain
Venue:
WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering
Year:
2012

Citing 8
Cited 1

RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Automatic information extraction from large websites

Journal of the ACM (JACM)
A Survey of Web Information Extraction Systems

IEEE Transactions on Knowledge and Data Engineering
From Wrapping to Knowledge

IEEE Transactions on Knowledge and Data Engineering
Extracting lists of data records from semi-structured web pages

Data & Knowledge Engineering
FiVaTech: Page-Level Web Data Extraction from Template Pages

IEEE Transactions on Knowledge and Data Engineering
Harvesting relational tables from lists on the web

Proceedings of the VLDB Endowment

TEX: An efficient and effective unsupervised Web information extractor

Knowledge-Based Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose a technique that takes two or more web pages generated by the same server-side template and tries to learn a regular expression that represents it and helps extract relevant information from similar pages. Our experimental results on real-world web sites demonstrate that our technique outperforms others in terms of both effectiveness and efficiency and is not affected by HTML errors.