Identifying syntactic differences between two programs
Software—Practice & Experience
A brief survey of web data extraction tools
ACM SIGMOD Record
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
Extracting structured data from Web pages
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Automatic web news extraction using tree edit distance
Proceedings of the 13th international conference on World Wide Web
Web data extraction based on partial tree alignment
WWW '05 Proceedings of the 14th international conference on World Wide Web
A Survey of Web Information Extraction Systems
IEEE Transactions on Knowledge and Data Engineering
Unsupervised Learning of Tree Alignment Models for Information Extraction
ICDMW '06 Proceedings of the Sixth IEEE International Conference on Data Mining - Workshops
Scalable and noise tolerant web knowledge extraction for search task simplification
Decision Support Systems
Hi-index | 0.00 |
This paper investigates the automatic extraction of data from forums, blogs and news web sites. Web pages are increasingly dynamically generated using a common template populated with data from databases. This paper proposes a novel method that uses tree alignment to automatically extract data from these types of web pages. A new tree alignment algorithm is presented for determining the optimal matching structure of the input web pages. Based on the alignment, the trees are merged into one union tree whose nodes record statistical information obtained from multiple web pages. A heuristic method is employed for determining the most probable content block and the alignment algorithm detects repeating patterns on the union tree. A wrapper built on the most probable content block and the repeating patterns extracts data from web pages. Experimental results show that the method achieves high extraction accuracy and has steady performance.