Automatic data record detection in web pages

Authors:
Xiaoying Gao;Le Phong Bao Vuong;Mengjie Zhang
Affiliations:
School of Mathematics, Statistics and Computer Science, Victoria University of Wellington, Wellington, New Zealand;School of Mathematics, Statistics and Computer Science, Victoria University of Wellington, Wellington, New Zealand;School of Mathematics, Statistics and Computer Science, Victoria University of Wellington, Wellington, New Zealand
Venue:
KSEM'07 Proceedings of the 2nd international conference on Knowledge science, engineering and management
Year:
2007

Citing 6
Cited 1

Bayesian classification (AutoClass): theory and results

Advances in knowledge discovery and data mining
A hierarchical approach to wrapper induction

Proceedings of the third annual conference on Autonomous Agents
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Mining data records in Web pages

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Data Extraction from Semi-structured Web Pages by Clustering

WI '06 Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
The lixto project: exploring new frontiers of web data extraction

BNCOD'06 Proceedings of the 23rd British National Conference on Databases, conference on Flexible and Efficient Information Handling

Detecting data records in semi-structured web sites based on text token clustering

Integrated Computer-Aided Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Wrapper induction is currently the main technology for data extraction from semi-structured web pages. However, wrapper induction has the limitation of requiring training Web pages, and the information extraction process is quite complex involving pattern induction, data extraction and data transformation. This paper introduces a new approach that achieves automatic data extraction by applying clustering to detecting similar text tokens, developing a new method to label text tokens to capture the hierarchical structure of HTML pages, and developing an algorithm for transforming labelled text tokens to XML. The approach is examined and compared with a number of existing wrapper induction systems on three different sets of web pages. The results suggest that the new approach is effective for data extraction and that it outperforms existing approaches on these web sites. This approach has the advantages of requiring no training and has no explicit processes for pattern induction or data extraction, therefore the whole process has been simplified.