Extracting data records from web using suffix tree

  • Authors:
  • Xiaoqin Xie;Yixiang Fang;Zhiqiang Zhang;Li Li

  • Affiliations:
  • Harbin Engineering University, R. R. China;Harbin Institute of Technology, R. R. China;Harbin Engineering University, R. R. China;Harbin Engineering University, R. R. China

  • Venue:
  • Proceedings of the ACM SIGKDD Workshop on Mining Data Semantics
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

There are many automatic methods that can extract lists of objects from the Web, but they often fail to handle multi-type pages automatically. This paper introduces a new method for record extraction using suffix tree which can find the repeated sub-string. Our method transfers a distinct group of tag paths appearing repeatedly in the DOM tree of the Web document to a sequence of integers firstly, and then builds a suffix tree by using this sequence. Four refining filter rules are defined. After the refining processes we can capture the useful data region patterns which can be used to extract data records. Experiments on real data show that this method is applicable for various web pages and can achieve higher accuracy and better robustness than previous methods.