Sequential PAttern mining using a bitmap representation
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Mining data records in Web pages
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Web-scale information extraction in knowitall: (preliminary results)
Proceedings of the 13th international conference on World Wide Web
Testbed for information extraction from deep web
Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters
A new suffix tree similarity measure for document clustering
Proceedings of the 16th international conference on World Wide Web
Extracting data records from the web using tag path clustering
Proceedings of the 18th international conference on World wide web
Efficient frequent sequence mining by a dynamic strategy switching algorithm
The VLDB Journal — The International Journal on Very Large Data Bases
Efficient record-level wrapper induction
Proceedings of the 18th ACM conference on Information and knowledge management
Web-scale information extraction with vertex
ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
Towards a unified solution: data record region detection and segmentation
Proceedings of the 20th ACM international conference on Information and knowledge management
Peer matrix alignment: a new algorithm
PAKDD'12 Proceedings of the 16th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part II
Hi-index | 0.00 |
There are many automatic methods that can extract lists of objects from the Web, but they often fail to handle multi-type pages automatically. This paper introduces a new method for record extraction using suffix tree which can find the repeated sub-string. Our method transfers a distinct group of tag paths appearing repeatedly in the DOM tree of the Web document to a sequence of integers firstly, and then builds a suffix tree by using this sequence. Four refining filter rules are defined. After the refining processes we can capture the useful data region patterns which can be used to extract data records. Experiments on real data show that this method is applicable for various web pages and can achieve higher accuracy and better robustness than previous methods.