Data extraction from web pages based on structural-semantic entropy

Authors:
Xiaoqing Zheng;Yiling Gu;Yinsheng Li
Affiliations:
Fudan University, Shanghai, China;Fudan University, Shanghai, China;Fudan University, Shanghai, China
Venue:
Proceedings of the 21st international conference companion on World Wide Web
Year:
2012

Citing 21
Cited 3

A linear programming framework for logics of uncertainty

Decision Support Systems
Grammars have exceptions

Information Systems - Special issue on semistructured data
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Conceptual-model-based data extraction from multiple-record Web pages

Data & Knowledge Engineering
IEPAD: information extraction based on pattern discovery

Proceedings of the 10th international conference on World Wide Web
A flexible learning system for wrapping tables and lists in HTML documents

Proceedings of the 11th international conference on World Wide Web
A brief survey of web data extraction tools

ACM SIGMOD Record
Using Grammatical Inference to Automate Information Extraction from the Web

PKDD '01 Proceedings of the 5th European Conference on Principles of Data Mining and Knowledge Discovery
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Automatic Web Information Extraction in the ROADRUNNER System

Revised Papers from the HUMACS, DASWIS, ECOMO, and DAMA on ER 2001 Workshops
Data extraction and label assignment for web databases

WWW '03 Proceedings of the 12th international conference on World Wide Web
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
OLERA: Semisupervised Web-Data Extraction with Visual Support

IEEE Intelligent Systems
A Survey of Web Information Extraction Systems

IEEE Transactions on Knowledge and Data Engineering
Structured Data Extraction from the Web Based on Partial Tree Alignment

IEEE Transactions on Knowledge and Data Engineering
Information Extraction from Web Pages Using Presentation Regularities and Domain Knowledge

World Wide Web
Extracting lists of data records from semi-structured web pages

Data & Knowledge Engineering
An unsupervised method for joint information extraction and feature mining across different Web sites

Data & Knowledge Engineering
Semi-supervised learning of attribute-value pairs from product descriptions

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Unsupervised named-entity extraction from the Web: An experimental study

Artificial Intelligence
Semistructured data: the TSIMMIS experience

ADBIS'97 Proceedings of the First East-European conference on Advances in Databases and Information systems

Mining taxonomies from web menus: rule-based concepts and algorithms

ICWE'13 Proceedings of the 13th international conference on Web Engineering
Strigil: A Framework for Data Extraction in Semi-Structured Web Documents

Proceedings of International Conference on Information Integration and Web-based Applications & Services
Linkage of compound objects for supporting maintenance of large-scale web sites

Proceedings of the 8th International Conference on Ubiquitous Information Management and Communication

Quantified Score

Hi-index	0.00

Visualization

Abstract

Most of today's web content is designed for human consumption, which makes it difficult for software tools to access them readily. Even web content that is automatically generated from back-end databases is usually presented without the original structural information. In this paper, we present an automated information extraction algorithm that can extract the relevant attribute-value pairs from product descriptions across different sites. A notion, called structural-semantic entropy, is used to locate the data of interest on web pages, which measures the density of occurrence of relevant information on the DOM tree representation of web pages. Our approach is less labor-intensive and insensitive to changes in web-page format. Experimental results on a large number of real-life web page collections are encouraging and confirm the feasibility of the approach, which has been successfully applied to detect false drug advertisements on the web due to its capacity in associating the attributes of records with their respective values.