Exploiting attribute redundancy for web entity data extraction

  • Authors:
  • Yanxu Zhu;Gang Yin;Xiang Li;Huaimin Wang;Dianxi Shi;Lin Yuan

  • Affiliations:
  • College of Computer Science and Technology, National University of Defense Technology, Changsha, Hunan, China;College of Computer Science and Technology, National University of Defense Technology, Changsha, Hunan, China;College of Computer Science and Technology, National University of Defense Technology, Changsha, Hunan, China;College of Computer Science and Technology and National Laboratory for Parallel and Distributed Processing, National University of Defense Technology, Changsha, Hunan, China;College of Computer Science and Technology, National University of Defense Technology, Changsha, Hunan, China;College of Electronic Technology, Information Engineering University, Zhengzhou, Henan, China

  • Venue:
  • ICADL'11 Proceedings of the 13th international conference on Asia-pacific digital libraries: for cultural heritage, knowledge dissemination, and future creation
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Web entities are often associated with many attributes that describe them. It is essential to extract these attributes for Web entity data extraction. This paper proposes a novel approach using duplicated attribute value pairs. We start by constructing a initial seed set of attributes including names and enumerable values, and a training set of Web pages from target website; After that we locate the position of each attribute by matching attribute values within the pages of the site with values contained in the seed set; Thirdly we choose the position with the highest supportiveness as path for extraction, which we use to extract other attribute value pairs with the same template. Finally, we conduct an extensive experimental study with large real data set to demonstrate the effectiveness of our extraction approach.