Extract knowledge from semi-structured websites for search task simplification

  • Authors:
  • Yingqin Gu;Jun Yan;Hongyan Liu;Jun He;Lei Ji;Ning Liu;Zheng Chen

  • Affiliations:
  • Renmin University of China, Beijing, China;Microsoft Research Asia, Beijing, China;Tsinghua University, Beijing, China;Renmin University of China, Beijing, China;Microsoft Research Asia, Beijing, China;Microsoft Research Asia, Beijing, China;Microsoft Research Asia, Beijing, China

  • Venue:
  • Proceedings of the 20th ACM international conference on Information and knowledge management
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Simplifying the key tasks of search engine users by directly retrieving to them structured knowledge according to their queries is attracting much attention from both industry and academia. A bottleneck of this challenging problem is how to extract the structured knowledge from the noisy and complex Web scale websites automatically. In this paper, we propose an unsupervised automatic wrapper induction algorithm, named as Scalable Knowledge Extractor from webSites (SKES). SKES induces the wrapper in a divide and conquer mode, i.e., it divides the general wrapper into several sub-wrappers to learn from the data independently. Moreover, through employing techniques such as tag path representation of Web pages, SKES is verified to be efficient and noise-tolerant by the experimental results. Furthermore, based on our automatically extracted knowledge, we also built a prototype to serve structured knowledge to end users for simplifying their key search tasks. Very positive feedbacks were received on the prototype.