Unexpected results in automatic list extraction on the web

Authors:
Tim Weninger;Fabio Fumarola;Rick Barber;Jiawei Han;Donato Malerba
Affiliations:
University of Illinois at Urbana-Champaign;Università degli Studi di Bari "Aldo Moro";University of Illinois at Urbana-Champaign;University of Illinois at Urbana-Champaign;Università degli Studi di Bari "Aldo Moro"
Venue:
ACM SIGKDD Explorations Newsletter
Year:
2011

Citing 9
Cited 5

RoadRunner: automatic data extraction from data-intensive web sites

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Mining data records in Web pages

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Towards domain-independent information extraction from web tables

Proceedings of the 16th international conference on World Wide Web
Flint: Google-basing the Web

EDBT '08 Proceedings of the 11th international conference on Extending database technology: Advances in database technology
Language-Independent Set Expansion of Named Entities Using the Web

ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
WebTables: exploring the power of tables on the web

Proceedings of the VLDB Endowment
Extracting data records from the web using tag path clustering

Proceedings of the 18th international conference on World wide web
Answering table augmentation queries from unstructured lists on the web

Proceedings of the VLDB Endowment
Extracting content structure for web pages based on visual representation

APWeb'03 Proceedings of the 5th Asia-Pacific web conference on Web technologies and applications

HyLiEn: a hybrid approach to general list extraction on the web

Proceedings of the 20th international conference companion on World wide web
WINACS: construction and analysis of web-based computer science information networks

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Extracting general lists from web documents: a hybrid approach

IEA/AIE'11 Proceedings of the 24th international conference on Industrial engineering and other applications of applied intelligent systems conference on Modern approaches in applied intelligence - Volume Part I
A system for extracting top-K lists from the web

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
The parallel path framework for entity discovery on the web

ACM Transactions on the Web (TWEB)

Quantified Score

Hi-index	0.00

Visualization

Abstract

The discovery and extraction of general lists on the Web continues to be an important problem facing theWeb mining community. There have been numerous studies that claim to automatically extract structured data (i.e. lists, record sets, tables, etc.) from the Web for various purposes. Our own recent experiences have shown that the list-finding methods used as part of these larger frameworks do not generalize well and therefore ought to be reevaluated. This paper briefly describes some of the current approaches, and tests them on various list-pages. Based on our findings, we conclude that analyzing aWeb page's DOM-structure is not sufficient for the general list finding task.