A hierarchical approach to wrapper induction
Proceedings of the third annual conference on Autonomous Agents
Introduction to Modern Information Retrieval
Introduction to Modern Information Retrieval
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
Mining data records in Web pages
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Message Understanding Conference-6: a brief history
COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 1
Detecting data records in semi-structured web sites based on text token clustering
Integrated Computer-Aided Engineering
Hi-index | 0.01 |
This paper introduces an approach that achieves automated data extraction from semi-structured Web pages by clustering. Both HTML tags and the textual features of text tokens are considered for similarity comparison. The first clustering process groups similar text tokens into the same text clusters, and the second clustering process groups similar data tuples into tuple clusters. A tuple cluster is a strong candidate of a repetitive data region.