Extracting data records from web using suffix tree

Authors:
Xiaoqin Xie;Yixiang Fang;Zhiqiang Zhang;Li Li
Affiliations:
Harbin Engineering University, R. R. China;Harbin Institute of Technology, R. R. China;Harbin Engineering University, R. R. China;Harbin Engineering University, R. R. China
Venue:
Proceedings of the ACM SIGKDD Workshop on Mining Data Semantics
Year:
2012

Citing 11
Cited 0

Sequential PAttern mining using a bitmap representation

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Mining data records in Web pages

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Web-scale information extraction in knowitall: (preliminary results)

Proceedings of the 13th international conference on World Wide Web
Testbed for information extraction from deep web

Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters
A new suffix tree similarity measure for document clustering

Proceedings of the 16th international conference on World Wide Web
Extracting data records from the web using tag path clustering

Proceedings of the 18th international conference on World wide web
Efficient frequent sequence mining by a dynamic strategy switching algorithm

The VLDB Journal — The International Journal on Very Large Data Bases
Efficient record-level wrapper induction

Proceedings of the 18th ACM conference on Information and knowledge management
Web-scale information extraction with vertex

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
Towards a unified solution: data record region detection and segmentation

Proceedings of the 20th ACM international conference on Information and knowledge management
Peer matrix alignment: a new algorithm

PAKDD'12 Proceedings of the 16th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part II

Quantified Score

Hi-index	0.00

Visualization

Abstract

There are many automatic methods that can extract lists of objects from the Web, but they often fail to handle multi-type pages automatically. This paper introduces a new method for record extraction using suffix tree which can find the repeated sub-string. Our method transfers a distinct group of tag paths appearing repeatedly in the DOM tree of the Web document to a sequence of integers firstly, and then builds a suffix tree by using this sequence. Four refining filter rules are defined. After the refining processes we can capture the useful data region patterns which can be used to extract data records. Experiments on real data show that this method is applicable for various web pages and can achieve higher accuracy and better robustness than previous methods.