ClusTex: Information Extraction from HTML Pages

Authors:
Fatima Ashraf;Reda Alhajj
Affiliations:
University of Calgary, Canada;University of Calgary, Canada/ Global University, Lebanon
Venue:
AINAW '07 Proceedings of the 21st International Conference on Advanced Information Networking and Applications Workshops - Volume 01
Year:
2007

Citing 0
Cited 1

Information extraction from web tables

Proceedings of the 11th International Conference on Information Integration and Web-based Applications & Services

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper propose ClusTex, a system which employsclustering techniques for automatic information extraction from HTML documents containing semi-structured data. Using domain-specific information provided by the user, ClusTex parses and tokenizes the data from an HTML document, partitions it into clusters containing similar elements, and estimates an extraction rule based on the pattern of occurrence of data tokens. The extraction rule is then used to refine clusters, and finally the output is reported. To demonstrate the effectivenessof this approach, the proposed approach is tested by conductingexperiments on the University of Calgary web-site; the results prove comparable to those reported in the literature.