Exploiting attribute redundancy for web entity data extraction

Authors:
Yanxu Zhu;Gang Yin;Xiang Li;Huaimin Wang;Dianxi Shi;Lin Yuan
Affiliations:
College of Computer Science and Technology, National University of Defense Technology, Changsha, Hunan, China;College of Computer Science and Technology, National University of Defense Technology, Changsha, Hunan, China;College of Computer Science and Technology, National University of Defense Technology, Changsha, Hunan, China;College of Computer Science and Technology and National Laboratory for Parallel and Distributed Processing, National University of Defense Technology, Changsha, Hunan, China;College of Computer Science and Technology, National University of Defense Technology, Changsha, Hunan, China;College of Electronic Technology, Information Engineering University, Zhengzhou, Henan, China
Venue:
ICADL'11 Proceedings of the 13th international conference on Asia-pacific digital libraries: for cultural heritage, knowledge dissemination, and future creation
Year:
2011

Citing 12
Cited 0

IEPAD: information extraction based on pattern discovery

Proceedings of the 10th international conference on World Wide Web
Athena: Mining-Based Interactive Management of Text Database

EDBT '00 Proceedings of the 7th International Conference on Extending Database Technology: Advances in Database Technology
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
The volume and evolution of web page templates

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Exploratory search: from finding to understanding

Communications of the ACM - Supporting exploratory search
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Introduction to Information Retrieval

Introduction to Information Retrieval
Extracting data records from the web using tag path clustering

Proceedings of the 18th international conference on World wide web
Duplicate identification in deep web data integration

WAIM'10 Proceedings of the 11th international conference on Web-age information management
Exploiting content redundancy for web information extraction

Proceedings of the VLDB Endowment
Link-based hidden attribute discovery for objects on Web

Proceedings of the 14th International Conference on Extending Database Technology
An indent shape based approach for web lists mining

WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II

Quantified Score

Hi-index	0.00

Visualization

Abstract

Web entities are often associated with many attributes that describe them. It is essential to extract these attributes for Web entity data extraction. This paper proposes a novel approach using duplicated attribute value pairs. We start by constructing a initial seed set of attributes including names and enumerable values, and a training set of Web pages from target website; After that we locate the position of each attribute by matching attribute values within the pages of the site with values contained in the seed set; Thirdly we choose the position with the highest supportiveness as path for extraction, which we use to extract other attribute value pairs with the same template. Finally, we conduct an extensive experimental study with large real data set to demonstrate the effectiveness of our extraction approach.