Strigil: A Framework for Data Extraction in Semi-Structured Web Documents

Authors:
Jakub Stárka;Irena Holubová;Martin Nečaský
Affiliations:
Department of Software Engineering, Charles University in Prague, Czech Republic;Department of Software Engineering, Charles University in Prague, Czech Republic;Department of Software Engineering, Charles University in Prague, Czech Republic
Venue:
Proceedings of International Conference on Information Integration and Web-based Applications & Services
Year:
2013

Citing 9
Cited 0

WebTables: exploring the power of tables on the web

Proceedings of the VLDB Endowment
ODE: Ontology-assisted data extraction

ACM Transactions on Database Systems (TODS)
xCrawl: a high-recall crawling method for Web mining

Knowledge and Information Systems - Special Issue:Best Papers from the 12th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD2008);Guest Editors: Takashi Washio, Einoshin Suzuki and Kai Ming Ting
Harvesting relational tables from lists on the web

The VLDB Journal — The International Journal on Very Large Data Bases
Data extraction from web pages based on structural-semantic entropy

Proceedings of the 21st international conference companion on World Wide Web
DIADEM: domain-centric, intelligent, automated data extraction methodology

Proceedings of the 21st international conference companion on World Wide Web
A framework for storing and providing aggregated governmental linked open data

EGOVIS'12/EDEM'12 Proceedings of the 2012 Joint international conference on Electronic Government and the Information Systems Perspective and Electronic Democracy, and Proceedings of the 2012 Joint international conference on Advancing Democracy, Government and Governance
OXPath: A language for scalable data extraction, automation, and crawling on the deep web

The VLDB Journal — The International Journal on Very Large Data Bases
Towards web-scale structured web data extraction

Proceedings of the sixth ACM international conference on Web search and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we introduce Strigil, a framework for automated data extraction. It represents an easily configurable tool that enables one to retrieve a data from textual or weak-structured documents. The paper contains description of the framework architecture and its important components. Additionally, we propose a scraping language inspired by the XSL transformations designed to extract data from different kinds of documents. Although there are many different approaches focused on various aspects of data scraping, they are usually very specialized to a concrete domain or a data source. We compare these solutions and discuss their advantages and disadvantages. Our scraping language is designed to work with an ontology to map scraped data directly to classes and attributes.