Using clustering for web information extraction

Authors:
Le Phong Bao Vuong;Xiaoying Gao
Affiliations:
School of Mathematics, Statistics and Computer Science, Victoria University of Wellington, Wellington, New Zealand;School of Mathematics, Statistics and Computer Science, Victoria University of Wellington, Wellington, New Zealand
Venue:
AI'07 Proceedings of the 20th Australian joint conference on Advances in artificial intelligence
Year:
2007

Citing 5
Cited 1

A hierarchical approach to wrapper induction

Proceedings of the third annual conference on Autonomous Agents
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Mining data records in Web pages

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Message Understanding Conference-6: a brief history

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 1

Detecting data records in semi-structured web sites based on text token clustering

Integrated Computer-Aided Engineering

Quantified Score

Hi-index	0.01

Visualization

Abstract

This paper introduces an approach that achieves automated data extraction from semi-structured Web pages by clustering. Both HTML tags and the textual features of text tokens are considered for similarity comparison. The first clustering process groups similar text tokens into the same text clusters, and the second clustering process groups similar data tuples into tuple clusters. A tuple cluster is a strong candidate of a repetitive data region.