Bayesian classification (AutoClass): theory and results
Advances in knowledge discovery and data mining
A hierarchical approach to wrapper induction
Proceedings of the third annual conference on Autonomous Agents
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
Mining data records in Web pages
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Data Extraction from Semi-structured Web Pages by Clustering
WI '06 Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
The lixto project: exploring new frontiers of web data extraction
BNCOD'06 Proceedings of the 23rd British National Conference on Databases, conference on Flexible and Efficient Information Handling
Detecting data records in semi-structured web sites based on text token clustering
Integrated Computer-Aided Engineering
Hi-index | 0.00 |
Wrapper induction is currently the main technology for data extraction from semi-structured web pages. However, wrapper induction has the limitation of requiring training Web pages, and the information extraction process is quite complex involving pattern induction, data extraction and data transformation. This paper introduces a new approach that achieves automatic data extraction by applying clustering to detecting similar text tokens, developing a new method to label text tokens to capture the hierarchical structure of HTML pages, and developing an algorithm for transforming labelled text tokens to XML. The approach is examined and compared with a number of existing wrapper induction systems on three different sets of web pages. The results suggest that the new approach is effective for data extraction and that it outperforms existing approaches on these web sites. This approach has the advantages of requiring no training and has no explicit processes for pattern induction or data extraction, therefore the whole process has been simplified.