A flexible learning system for wrapping tables and lists in HTML documents
Proceedings of the 11th international conference on World Wide Web
A brief survey of web data extraction tools
ACM SIGMOD Record
Automatic information extraction from semi-structured Web pages by pattern discovery
Decision Support Systems - Web retrieval and mining
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
Extracting Patterns and Relations from the World Wide Web
WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Extracting structured data from Web pages
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Mining data records in Web pages
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
On the complexity of schema inference from web pages in the presence of nullable data attributes
CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Fully automatic wrapper generation for search engines
WWW '05 Proceedings of the 14th international conference on World Wide Web
Web data extraction based on partial tree alignment
WWW '05 Proceedings of the 14th international conference on World Wide Web
A Survey of Web Information Extraction Systems
IEEE Transactions on Knowledge and Data Engineering
Communications of the ACM - ACM at sixty: a look back in time
Mining templates from search result records of search engines
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Corroborate and learn facts from the web
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Bootstrapping Information Extraction from Semi-structured Web Pages
ECML PKDD '08 Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases - Part I
Incorporating site-level knowledge to extract structured data from web forums
Proceedings of the 18th international conference on World wide web
Extracting data records from the web using tag path clustering
Proceedings of the 18th international conference on World wide web
Open information extraction from the web
IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Unsupervised named-entity extraction from the Web: An experimental study
Artificial Intelligence
Automatic web data extraction using tree alignment
Proceedings of the 18th ACM conference on Information and knowledge management
An evidential approach to query interface matching on the deep Web
Information Systems
Scalable Attribute-Value Extraction from Semi-structured Text
ICDMW '09 Proceedings of the 2009 IEEE International Conference on Data Mining Workshops
Exploiting content redundancy for web information extraction
Proceedings of the 19th international conference on World wide web
Quantum Path Integral Inspired Query Sequence Suggestion for User Search Task Simplification
ICDMW '10 Proceedings of the 2010 IEEE International Conference on Data Mining Workshops
Highly efficient algorithms for structural clustering of large websites
Proceedings of the 20th international conference on World wide web
Web-scale information extraction with vertex
ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
NET – a system for extracting web data from flat and nested data records
WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
Automatic Extraction of Structured Web Data with Domain Knowledge
ICDE '12 Proceedings of the 2012 IEEE 28th International Conference on Data Engineering
Hi-index | 0.00 |
The simplification of key tasks of search engine users by directly returning structured knowledge according to their query intents has attracted much attention from both the industry and the academia. The challenge lies in automatically extracting structured knowledge from noisy and complex web scale websites. Although various automatic wrapper induction algorithms have been proposed, ineffectiveness or inefficiency issues beset many of their web scale applications. In this paper, we propose an unsupervised automatic wrapper induction algorithm, named SKES, to efficiently extract knowledge from semi-structured websites. SKES induces the wrapper in a divide-and-conquer mode; dividing the general wrapper into sub-wrappers that can independently learn from data, making it efficient and easy to implement in a parallel mode. Moreover, by employing techniques such as tag path representation of web pages, SKES can dramatically reduce the number of tags and naturally differentiate their roles. The proposed solution was applied and evaluated on a large number of real websites as well as compared with two existing methods that are most related to it. The proposed method is much more efficient than the existing methods, and provided high extraction accuracy. We have extracted 2.5million entities and 29million data fields from over 10 thousand high traffic websites, which demonstrates the applicability of this method. Furthermore, based on the automatically extracted data, we built a prototype to serve structured knowledge that simplifies the key search tasks of end users. The feedback received for the prototype was highly positive.