Towards web-scale structured web data extraction

Authors:
Tomas Grigalis
Affiliations:
Vilnius Gediminas Technical University, VIlnius, Lithuania
Venue:
Proceedings of the sixth ACM international conference on Web search and data mining
Year:
2013

Citing 29
Cited 1

Generating finite-state transducers for semi-structured data extraction from the Web

Information Systems - Special issue on semistructured data
IEPAD: information extraction based on pattern discovery

Proceedings of the 10th international conference on World Wide Web
Visual Web Information Extraction with Lixto

Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Testbed for information extraction from deep web

Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters
Fully automatic wrapper generation for search engines

WWW '05 Proceedings of the 14th international conference on World Wide Web
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
Yago: a core of semantic knowledge

Proceedings of the 16th international conference on World Wide Web
Extracting lists of data records from semi-structured web pages

Data & Knowledge Engineering
Open information extraction from the web

Communications of the ACM - Surviving the data deluge
WebTables: exploring the power of tables on the web

Proceedings of the VLDB Endowment
Crowdsourcing for relevance evaluation

ACM SIGIR Forum
Extracting data records from the web using tag path clustering

Proceedings of the 18th international conference on World wide web
ODE: Ontology-assisted data extraction

ACM Transactions on Database Systems (TODS)
A web of concepts

Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Robust web extraction: an approach based on a probabilistic tree-edit model

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Efficient record-level wrapper induction

Proceedings of the 18th ACM conference on Information and knowledge management
FiVaTech: Page-Level Web Data Extraction from Template Pages

IEEE Transactions on Knowledge and Data Engineering
Harvesting relational tables from lists on the web

Proceedings of the VLDB Endowment
ViDE: A Vision-Based Approach for Deep Web Data Extraction

IEEE Transactions on Knowledge and Data Engineering
From information to knowledge: harvesting entities and relationships from web sources

Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Structured data on the web

Communications of the ACM
Automatic wrappers for large scale web extraction

Proceedings of the VLDB Endowment
Highly efficient algorithms for structural clustering of large websites

Proceedings of the 20th international conference on World wide web
Web-scale information extraction with vertex

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
An analysis of structured data on the web

Proceedings of the VLDB Endowment
Automatic web-scale information extraction

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Open information extraction: the second generation

IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume One
Clustering visually similar web page elements for structured web data extraction

ICWE'12 Proceedings of the 12th international conference on Web Engineering

Strigil: A Framework for Data Extraction in Semi-Structured Web Documents

Proceedings of International Conference on Information Integration and Web-based Applications & Services

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we present an ongoing PhD research on unsupervised and domain-independent structured data extraction from the Web. We propose a novel method to extract structured data records from template-generated Web pages. The method is based on clustering visually similar Web page elements by exploiting their visual formatting and HTML structural features. Tag paths of clustered Web page elements are then employed to derive extraction rules. These rules, called wrappers, can be later reused on thousands of same template-generated Web pages. This opens the possibility for the proposed method to be deployed in Web-Scale structured data extraction systems.