Mining data records in Web pages
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Fully automatic wrapper generation for search engines
WWW '05 Proceedings of the 14th international conference on World Wide Web
IEEE Transactions on Knowledge and Data Engineering
Structured Data Extraction from the Web Based on Partial Tree Alignment
IEEE Transactions on Knowledge and Data Engineering
ViDE: A Vision-Based Approach for Deep Web Data Extraction
IEEE Transactions on Knowledge and Data Engineering
SXPath: extending XPath towards spatial querying on web documents
Proceedings of the VLDB Endowment
Feature-based object identification for web automation
Proceedings of the 28th Annual ACM Symposium on Applied Computing
Hi-index | 0.00 |
Deep Web pages convey very relevant information for different application domains like e-government, e-commerce, social networking. For this reason there is a constant high interest in efficiently, effectively and automatically extracting data from Deep Web data sources. In this paper we present SILA, a novel Spatial Instance Learning Approach, that allows for extracting data records from Deep Web pages by exploiting both the spatial arrangement and the presentation features of data items/fields produced by layout engines of Web browsers in visualizing Deep Web pages on the screen. SILA is independent from the internal HTML encodings of Web pages, and allows for recognizing data records in pages having multiple data regions in which data items are arranged by many different presentation layouts. Experimental results show that SILA has very high precision and recall and that it works much better than MDR and ViNTs approaches.