Extracting general lists from web documents: a hybrid approach

Authors:
Fabio Fumarola;Tim Weninger;Rick Barber;Donato Malerba;Jiawei Han
Affiliations:
Dipartimento di Informatica, Università degli Studi di Bari "Aldo Moro", Bari, Italy;Computer Science Department, University of Illinois at Urbana-Champaign, Urbana-Champaign, IL;Computer Science Department, University of Illinois at Urbana-Champaign, Urbana-Champaign, IL;Dipartimento di Informatica, Università degli Studi di Bari "Aldo Moro", Bari, Italy;Computer Science Department, University of Illinois at Urbana-Champaign, Urbana-Champaign, IL
Venue:
IEA/AIE'11 Proceedings of the 24th international conference on Industrial engineering and other applications of applied intelligent systems conference on Modern approaches in applied intelligence - Volume Part I
Year:
2011

Citing 15
Cited 2

Cascading Style Sheets: Designing for the Web

Cascading Style Sheets: Designing for the Web
RoadRunner: automatic data extraction from data-intensive web sites

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Mining data records in Web pages

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Using the structure of Web sites for automatic segmentation of tables

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
Extracting semantic structure of web documents using content and visual information

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Structured Data Extraction from the Web Based on Partial Tree Alignment

IEEE Transactions on Knowledge and Data Engineering
Towards domain-independent information extraction from web tables

Proceedings of the 16th international conference on World Wide Web
Language-Independent Set Expansion of Named Entities Using the Web

ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
WebTables: exploring the power of tables on the web

Proceedings of the VLDB Endowment
Extracting data records from the web using tag path clustering

Proceedings of the 18th international conference on World wide web
Answering table augmentation queries from unstructured lists on the web

Proceedings of the VLDB Endowment
ViDE: A Vision-Based Approach for Deep Web Data Extraction

IEEE Transactions on Knowledge and Data Engineering
Extracting content structure for web pages based on visual representation

APWeb'03 Proceedings of the 5th Asia-Pacific web conference on Web technologies and applications
Unexpected results in automatic list extraction on the web

ACM SIGKDD Explorations Newsletter

WINACS: construction and analysis of web-based computer science information networks

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
A system for extracting top-K lists from the web

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

The problem of extracting structured data (i.e. lists, record sets, tables, etc.) from the Web has been traditionally approached by taking into account either the underlying markup structure of a Web page or the visual structure of the Web page. However, empirical results show that considering the HTML structure and visual cues of a Web page independently do not generalize well. We propose a new hybrid method to extract general lists from the Web. It employs both general assumptions on the visual rendering of lists, and the structural representation of items contained in them. We show that our method significantly outperforms existing methods across a varied Web corpus.