A dynamic learning framework to thoroughly extract structured data from web pages without human efforts

  • Authors:
  • Dandan Song;Yunpeng Wu;Lejian Liao;Long Li;Fei Sun

  • Affiliations:
  • Beijing Institute of Technology, Beijing, China;Beijing Institute of Technology, Beijing, China;Beijing Institute of Technology, Beijing, China;Beijing Institute of Technology, Beijing, China;Beijing Institute of Technology, Beijing, China

  • Venue:
  • Proceedings of the ACM SIGKDD Workshop on Mining Data Semantics
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Tremendous concrete and comprehensive information is contained in structured data of web pages. Attributes and their corresponding values of entities are precious resources for automatic semantic annotation, knowledge discovery, and information utilization. However, various displaying styles and formats of web pages make it a challenging task to extract them. Based on our observation, despite the lack of information in a single page, different web pages and different web sites illustrating similar entities can provide adequate knowledge for computers to learn. This paper presents a dynamic learning framework to effectively extract structured information from enormous websites in various verticals (e.g., books, cameras, jobs). Different with other existing approaches that are static, require manually labeling samples and can not be flexible to unseen attributes, our approach aims at dynamically, automatically and thoroughly extracting structured data from web pages. Experiments with totally 17,850 web pages in 4 verticals demonstrated the effectiveness of our framework.