A hierarchical approach to wrapper induction
Proceedings of the third annual conference on Autonomous Agents
Learning Information Extraction Rules for Semi-Structured and Free Text
Machine Learning - Special issue on natural language learning
IEPAD: information extraction based on pattern discovery
Proceedings of the 10th international conference on World Wide Web
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
SemTag and seeker: bootstrapping the semantic web via automated semantic annotation
WWW '03 Proceedings of the 12th international conference on World Wide Web
Extracting structured data from Web pages
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Wrapper induction for information extraction
Wrapper induction for information extraction
Mining data records in Web pages
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Web data extraction based on partial tree alignment
WWW '05 Proceedings of the 14th international conference on World Wide Web
2D Conditional Random Fields for Web information extraction
ICML '05 Proceedings of the 22nd international conference on Machine learning
Structure-driven crawler generation by example
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Simultaneous record detection and attribute labeling in web data extraction
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Joint optimization of wrapper generation and template detection
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Bootstrapping Information Extraction from Semi-structured Web Pages
ECML PKDD '08 Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases - Part I
Incorporating site-level knowledge to extract structured data from web forums
Proceedings of the 18th international conference on World wide web
Mining employment market via text block detection and adaptive cross-domain information extraction
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
IEEE Transactions on Knowledge and Data Engineering
Proceedings of the ACM SIGKDD Workshop on Mining Data Semantics
Robust web data extraction: a novel approach based on minimum cost script edit model
WISM'12 Proceedings of the 2012 international conference on Web Information Systems and Mining
Unsupervised wrapper induction using linked data
Proceedings of the seventh international conference on Knowledge capture
Robust detection of semi-structured web records using a DOM structure-knowledge-driven model
ACM Transactions on the Web (TWEB)
Extraction and integration of partially overlapping web sources
Proceedings of the VLDB Endowment
Hi-index | 0.00 |
Structured data, in the form of entities and associated attributes, has been a rich web resource for search engines and knowledge databases. To efficiently extract structured data from enormous websites in various verticals (e.g., books, restaurants), much research effort has been attracted, but most existing approaches either require considerable human effort or rely on strong features that lack of flexibility. We consider an ambitious scenario -- can we build a system that (1) is general enough to handle any vertical without re-implementation and (2) requires only one labeled example site from each vertical for training to automatically deal with other sites in the same vertical? In this paper, we propose a unified solution to demonstrate the feasibility of this scenario. Specifically, we design a set of weak but general features to characterize vertical knowledge (including attribute-specific semantics and inter-attribute layout relationships). Such features can be adopted in various verticals without redesign; meanwhile, they are weak enough to avoid overfitting of the learnt knowledge to seed sites. Given a new unseen site, the learnt knowledge is first applied to identify page-level candidate attribute values, while inevitably involve false positives. To remove noise, site-level information of the new site is then exploited to boost up the true values. The site-level information is derived in an unsupervised manner, without harm to the applicability of the solution. Promising experimental performance on 80 websites in 8 distinct verticals demonstrated the feasibility and flexibility of the proposed solution.