RoadRunner: automatic data extraction from data-intensive web sites
Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Mining data records in Web pages
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Towards domain-independent information extraction from web tables
Proceedings of the 16th international conference on World Wide Web
EDBT '08 Proceedings of the 11th international conference on Extending database technology: Advances in database technology
Language-Independent Set Expansion of Named Entities Using the Web
ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
WebTables: exploring the power of tables on the web
Proceedings of the VLDB Endowment
Extracting data records from the web using tag path clustering
Proceedings of the 18th international conference on World wide web
Answering table augmentation queries from unstructured lists on the web
Proceedings of the VLDB Endowment
Extracting content structure for web pages based on visual representation
APWeb'03 Proceedings of the 5th Asia-Pacific web conference on Web technologies and applications
HyLiEn: a hybrid approach to general list extraction on the web
Proceedings of the 20th international conference companion on World wide web
WINACS: construction and analysis of web-based computer science information networks
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Extracting general lists from web documents: a hybrid approach
IEA/AIE'11 Proceedings of the 24th international conference on Industrial engineering and other applications of applied intelligent systems conference on Modern approaches in applied intelligence - Volume Part I
A system for extracting top-K lists from the web
Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
The parallel path framework for entity discovery on the web
ACM Transactions on the Web (TWEB)
Hi-index | 0.00 |
The discovery and extraction of general lists on the Web continues to be an important problem facing theWeb mining community. There have been numerous studies that claim to automatically extract structured data (i.e. lists, record sets, tables, etc.) from the Web for various purposes. Our own recent experiences have shown that the list-finding methods used as part of these larger frameworks do not generalize well and therefore ought to be reevaluated. This paper briefly describes some of the current approaches, and tests them on various list-pages. Based on our findings, we conclude that analyzing aWeb page's DOM-structure is not sufficient for the general list finding task.