Clustering visually similar web page elements for structured web data extraction

Authors:
Tomas Grigalis;Lukas Radvilavičius;Antanas Čenys;Juozas Gordevičius
Affiliations:
Vilnius Gediminas Technical University, Lithuania;Vilnius Gediminas Technical University, Lithuania;Vilnius Gediminas Technical University, Lithuania;Institute of Mathematics and Informatics, Vilnius University, Lithuania
Venue:
ICWE'12 Proceedings of the 12th international conference on Web Engineering
Year:
2012

Citing 7
Cited 1

Visual Web Information Extraction with Lixto

Proceedings of the 27th International Conference on Very Large Data Bases
Testbed for information extraction from deep web

Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters
Fully automatic wrapper generation for search engines

WWW '05 Proceedings of the 14th international conference on World Wide Web
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
Extracting lists of data records from semi-structured web pages

Data & Knowledge Engineering
Extracting data records from the web using tag path clustering

Proceedings of the 18th international conference on World wide web
FiVaTech: Page-Level Web Data Extraction from Template Pages

IEEE Transactions on Knowledge and Data Engineering

Towards web-scale structured web data extraction

Proceedings of the sixth ACM international conference on Web search and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose a novel approach for extraction of structured web data called ClustVX. It clusters visually similar web page elements by exploiting their visual formatting and structural features. Clusters are then used to derive extraction rules. The experimental evaluation results of ClustVX system on three publicly available benchmark data sets outperform state-of-the-art structured data extraction systems.