Scalable and noise tolerant web knowledge extraction for search task simplification

Authors:
Jun He;Yingqin Gu;Hongyan Liu;Jun Yan;Hong Chen
Affiliations:
Key Labs of Data Engineering and Knowledge Engineering, Ministry of Education, China and School of Information, Renmin University of China, China;Key Labs of Data Engineering and Knowledge Engineering, Ministry of Education, China and School of Information, Renmin University of China, China;Research Center for Contemporary Management, Tsinghua University, China and Department of Management Science and Engineering, Tsinghua University, China;Microsoft Research Asia, Beijing, China;Key Labs of Data Engineering and Knowledge Engineering, Ministry of Education, China and School of Information, Renmin University of China, China
Venue:
Decision Support Systems
Year:
2013

Citing 29
Cited 0

A flexible learning system for wrapping tables and lists in HTML documents

Proceedings of the 11th international conference on World Wide Web
A brief survey of web data extraction tools

ACM SIGMOD Record
Automatic information extraction from semi-structured Web pages by pattern discovery

Decision Support Systems - Web retrieval and mining
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Extracting Patterns and Relations from the World Wide Web

WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Mining data records in Web pages

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
On the complexity of schema inference from web pages in the presence of nullable data attributes

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Fully automatic wrapper generation for search engines

WWW '05 Proceedings of the 14th international conference on World Wide Web
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
A Survey of Web Information Extraction Systems

IEEE Transactions on Knowledge and Data Engineering
Beyond keyword and cue-phrase matching: a sentence-based abstraction technique for information extraction

Decision Support Systems
Accessing the deep web

Communications of the ACM - ACM at sixty: a look back in time
Mining templates from search result records of search engines

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Corroborate and learn facts from the web

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Bootstrapping Information Extraction from Semi-structured Web Pages

ECML PKDD '08 Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases - Part I
Incorporating site-level knowledge to extract structured data from web forums

Proceedings of the 18th international conference on World wide web
Extracting data records from the web using tag path clustering

Proceedings of the 18th international conference on World wide web
Open information extraction from the web

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Unsupervised named-entity extraction from the Web: An experimental study

Artificial Intelligence
Automatic web data extraction using tree alignment

Proceedings of the 18th ACM conference on Information and knowledge management
An evidential approach to query interface matching on the deep Web

Information Systems
Scalable Attribute-Value Extraction from Semi-structured Text

ICDMW '09 Proceedings of the 2009 IEEE International Conference on Data Mining Workshops
Exploiting content redundancy for web information extraction

Proceedings of the 19th international conference on World wide web
Quantum Path Integral Inspired Query Sequence Suggestion for User Search Task Simplification

ICDMW '10 Proceedings of the 2010 IEEE International Conference on Data Mining Workshops
Highly efficient algorithms for structural clustering of large websites

Proceedings of the 20th international conference on World wide web
Web-scale information extraction with vertex

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
NET – a system for extracting web data from flat and nested data records

WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
Automatic Extraction of Structured Web Data with Domain Knowledge

ICDE '12 Proceedings of the 2012 IEEE 28th International Conference on Data Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

The simplification of key tasks of search engine users by directly returning structured knowledge according to their query intents has attracted much attention from both the industry and the academia. The challenge lies in automatically extracting structured knowledge from noisy and complex web scale websites. Although various automatic wrapper induction algorithms have been proposed, ineffectiveness or inefficiency issues beset many of their web scale applications. In this paper, we propose an unsupervised automatic wrapper induction algorithm, named SKES, to efficiently extract knowledge from semi-structured websites. SKES induces the wrapper in a divide-and-conquer mode; dividing the general wrapper into sub-wrappers that can independently learn from data, making it efficient and easy to implement in a parallel mode. Moreover, by employing techniques such as tag path representation of web pages, SKES can dramatically reduce the number of tags and naturally differentiate their roles. The proposed solution was applied and evaluated on a large number of real websites as well as compared with two existing methods that are most related to it. The proposed method is much more efficient than the existing methods, and provided high extraction accuracy. We have extracted 2.5million entities and 29million data fields from over 10 thousand high traffic websites, which demonstrates the applicability of this method. Furthermore, based on the automatically extracted data, we built a prototype to serve structured knowledge that simplifies the key search tasks of end users. The feedback received for the prototype was highly positive.