Collective extraction from heterogeneous web lists

Authors:
Ashwin Machanavajjhala;Arun Shankar Iyer;Philip Bohannon;Srujana Merugu
Affiliations:
Yahoo! Research, Santa Clara, CA, USA;Yahoo! Research, Bangalore, India;Yahoo! Research, Santa Clara, CA, USA;Yahoo! Research, Santa Clara, CA, USA
Venue:
Proceedings of the fourth ACM international conference on Web search and data mining
Year:
2011

Citing 16
Cited 4

Data integration using similarity joins and a word-based information representation language

ACM Transactions on Information Systems (TOIS)
Automatic segmentation of text into structured records

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Mining reference tables for automatic text segmentation

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
A hierarchical graphical model for record linkage

UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
Integrating Unstructured Data into Relational Databases

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Simultaneous record detection and attribute labeling in web data extraction

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
A Survey of Web Information Extraction Systems

IEEE Transactions on Knowledge and Data Engineering
Building structured web community portals: a top-down, compositional, and incremental approach

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Context-aware wrapping: synchronized data extraction

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Extracting lists of data records from semi-structured web pages

Data & Knowledge Engineering
Answering table augmentation queries from unstructured lists on the web

Proceedings of the VLDB Endowment
Harvesting relational tables from lists on the web

Proceedings of the VLDB Endowment
Exploiting content redundancy for web information extraction

Proceedings of the VLDB Endowment

Highly efficient algorithms for structural clustering of large websites

Proceedings of the 20th international conference on World wide web
An analysis of structured data on the web

Proceedings of the VLDB Endowment
TEX: An efficient and effective unsupervised Web information extractor

Knowledge-Based Systems
Knowledge harvesting in the big-data era

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

Quantified Score

Hi-index	0.00

Visualization

Abstract

Automatic extraction of structured records from inconsistently formatted lists on the web is challenging: different lists present disparate sets of attributes with variations in the ordering of attributes; many lists contain additional attributes and noise that can confuse the extraction process; and formatting within a list may be inconsistent due to missing attributes or manual formatting on some sites. We present a novel solution to this extraction problem that is based on i) collective extraction from multiple lists simultaneously and ii) careful exploitation of a small database of seed entities. Our approach addresses the layout homogeneity within the individual lists, content redundancy across some snippets from different sources, and the noisy attribute rendering process. We experimentally evaluate variants of this algorithm on real world data sets and show that our approach is a promising direction for extraction from noisy lists, requiring mild and thus inexpensive supervision suitable for extraction from the tail of the web.