HyLiEn: a hybrid approach to general list extraction on the web

Authors:
Fabio Fumarola;Tim Weninger;Rick Barber;Donato Malerba;Jiawei Han
Affiliations:
Università degli Studi di Bari, Bari, UNK, Italy;University of Illinois at Urbana-Champaign, Urbana-Champaign, IL, USA;University of Illinois at Urbana-Champaign, Urbana-Champaign, IL, USA;Università degli Studi di Bari, Bari, Italy;University of Illinois at Urbana-Champaign, Urbana-Champaign, IL, USA
Venue:
Proceedings of the 20th international conference companion on World wide web
Year:
2011

Citing 7
Cited 2

Using the structure of Web sites for automatic segmentation of tables

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
ViPER: augmenting automatic information extraction with visual perceptions

Proceedings of the 14th ACM international conference on Information and knowledge management
Towards domain-independent information extraction from web tables

Proceedings of the 16th international conference on World Wide Web
Language-Independent Set Expansion of Named Entities Using the Web

ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
WebTables: exploring the power of tables on the web

Proceedings of the VLDB Endowment
ViDE: A Vision-Based Approach for Deep Web Data Extraction

IEEE Transactions on Knowledge and Data Engineering
Unexpected results in automatic list extraction on the web

ACM SIGKDD Explorations Newsletter

Exploring structure and content on the web: extraction and integration of the semi-structured web

Proceedings of the sixth ACM international conference on Web search and data mining
The parallel path framework for entity discovery on the web

ACM Transactions on the Web (TWEB)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider the problem of automatically extracting general lists from the web. Existing approaches are mostly dependent upon either the underlying HTML markup or the visual structure of the Web page. We present HyLiEn an unsupervised, Hybrid approach for automatic List discovery and Extraction on the Web. It employs general assumptions about the visual rendering of lists, and the structural representation of items contained in them. We show that our method significantly outperforms existing methods.