A dynamic learning framework to thoroughly extract structured data from web pages without human efforts

Authors:
Dandan Song;Yunpeng Wu;Lejian Liao;Long Li;Fei Sun
Affiliations:
Beijing Institute of Technology, Beijing, China;Beijing Institute of Technology, Beijing, China;Beijing Institute of Technology, Beijing, China;Beijing Institute of Technology, Beijing, China;Beijing Institute of Technology, Beijing, China
Venue:
Proceedings of the ACM SIGKDD Workshop on Mining Data Semantics
Year:
2012

Citing 19
Cited 1

NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Learning dictionaries for information extraction by multi-level bootstrapping

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
RoadRunner: automatic data extraction from data-intensive web sites

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Mining data records in Web pages

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
2D Conditional Random Fields for Web information extraction

ICML '05 Proceedings of the 22nd international conference on Machine learning
Adaptive information extraction

ACM Computing Surveys (CSUR)
Interactive wrapper generation with minimal user effort

Proceedings of the 15th international conference on World Wide Web
Simultaneous record detection and attribute labeling in web data extraction

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Structured Data Extraction from the Web Based on Partial Tree Alignment

IEEE Transactions on Knowledge and Data Engineering
Joint optimization of wrapper generation and template detection

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
An unsupervised framework for extracting and normalizing product attributes from multiple web sites

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Open information extraction from the web

Communications of the ACM - Surviving the data deluge
Interactive information extraction with constrained conditional random fields

AAAI'04 Proceedings of the 19th national conference on Artifical intelligence
Unsupervised named-entity extraction from the Web: An experimental study

Artificial Intelligence
Learning to Adapt Web Information Extraction Knowledge and Discovering New Attributes via a Bayesian Approach

IEEE Transactions on Knowledge and Data Engineering
Build Watson: an overview of DeepQA for the Jeopardy! challenge

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
From one tree to a forest: a unified solution for structured web data extraction

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval

Unsupervised wrapper induction using linked data

Proceedings of the seventh international conference on Knowledge capture

Quantified Score

Hi-index	0.00

Visualization

Abstract

Tremendous concrete and comprehensive information is contained in structured data of web pages. Attributes and their corresponding values of entities are precious resources for automatic semantic annotation, knowledge discovery, and information utilization. However, various displaying styles and formats of web pages make it a challenging task to extract them. Based on our observation, despite the lack of information in a single page, different web pages and different web sites illustrating similar entities can provide adequate knowledge for computers to learn. This paper presents a dynamic learning framework to effectively extract structured information from enormous websites in various verticals (e.g., books, cameras, jobs). Different with other existing approaches that are static, require manually labeling samples and can not be flexible to unseen attributes, our approach aims at dynamically, automatically and thoroughly extracting structured data from web pages. Experiments with totally 17,850 web pages in 4 verticals demonstrated the effectiveness of our framework.