Simultaneous record detection and attribute labeling in web data extraction

  • Authors:
  • Jun Zhu;Zaiqing Nie;Ji-Rong Wen;Bo Zhang;Wei-Ying Ma

  • Affiliations:
  • Tsinghua University, Beijing, China;Microsoft Research Asia, Beijing, China;Microsoft Research Asia, Beijing, China;Tsinghua University, Beijing, China;Microsoft Research Asia, Beijing, China

  • Venue:
  • Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Recent work has shown the feasibility and promise of template-independent Web data extraction. However, existing approaches use decoupled strategies - attempting to do data record detection and attribute labeling in two separate phases. In this paper, we show that separately extracting data records and attributes is highly ineffective and propose a probabilistic model to perform these two tasks simultaneously. In our approach, record detection can benefit from the availability of semantics required in attribute labeling and, at the same time, the accuracy of attribute labeling can be improved when data records are labeled in a collective manner. The proposed model is called Hierarchical Conditional Random Fields. It can efficiently integrate all useful features by learning their importance, and it can also incorporate hierarchical interactions which are very important for Web data extraction. We empirically compare the proposed model with existing decoupled approaches for product information extraction, and the results show significant improvements in both record detection and attribute labeling.