Data integration using similarity joins and a word-based information representation language
ACM Transactions on Information Systems (TOIS)
Automatic segmentation of text into structured records
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
Extracting structured data from Web pages
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Mining reference tables for automatic text segmentation
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
A hierarchical graphical model for record linkage
UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
Web data extraction based on partial tree alignment
WWW '05 Proceedings of the 14th international conference on World Wide Web
Integrating Unstructured Data into Relational Databases
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Simultaneous record detection and attribute labeling in web data extraction
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
A Survey of Web Information Extraction Systems
IEEE Transactions on Knowledge and Data Engineering
Building structured web community portals: a top-down, compositional, and incremental approach
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Context-aware wrapping: synchronized data extraction
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Extracting lists of data records from semi-structured web pages
Data & Knowledge Engineering
Answering table augmentation queries from unstructured lists on the web
Proceedings of the VLDB Endowment
Harvesting relational tables from lists on the web
Proceedings of the VLDB Endowment
Exploiting content redundancy for web information extraction
Proceedings of the VLDB Endowment
Highly efficient algorithms for structural clustering of large websites
Proceedings of the 20th international conference on World wide web
An analysis of structured data on the web
Proceedings of the VLDB Endowment
TEX: An efficient and effective unsupervised Web information extractor
Knowledge-Based Systems
Knowledge harvesting in the big-data era
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Hi-index | 0.00 |
Automatic extraction of structured records from inconsistently formatted lists on the web is challenging: different lists present disparate sets of attributes with variations in the ordering of attributes; many lists contain additional attributes and noise that can confuse the extraction process; and formatting within a list may be inconsistent due to missing attributes or manual formatting on some sites. We present a novel solution to this extraction problem that is based on i) collective extraction from multiple lists simultaneously and ii) careful exploitation of a small database of seed entities. Our approach addresses the layout homogeneity within the individual lists, content redundancy across some snippets from different sources, and the noisy attribute rendering process. We experimentally evaluate variants of this algorithm on real world data sets and show that our approach is a promising direction for extraction from noisy lists, requiring mild and thus inexpensive supervision suitable for extraction from the tail of the web.