Automatic web data extraction using tree alignment

Authors:
Yingju Xia;Hao Yu;Shu Zhang
Affiliations:
Fujitsu Research & Development Center Co., LTD., Beijing, China;Fujitsu Research & Development Center Co., LTD., Beijing, China;Fujitsu Research & Development Center Co., LTD., Beijing, China
Venue:
Proceedings of the 18th ACM conference on Information and knowledge management
Year:
2009

Citing 8
Cited 1

Identifying syntactic differences between two programs

Software—Practice & Experience
A brief survey of web data extraction tools

ACM SIGMOD Record
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Automatic web news extraction using tree edit distance

Proceedings of the 13th international conference on World Wide Web
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
A Survey of Web Information Extraction Systems

IEEE Transactions on Knowledge and Data Engineering
Unsupervised Learning of Tree Alignment Models for Information Extraction

ICDMW '06 Proceedings of the Sixth IEEE International Conference on Data Mining - Workshops

Scalable and noise tolerant web knowledge extraction for search task simplification

Decision Support Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper investigates the automatic extraction of data from forums, blogs and news web sites. Web pages are increasingly dynamically generated using a common template populated with data from databases. This paper proposes a novel method that uses tree alignment to automatically extract data from these types of web pages. A new tree alignment algorithm is presented for determining the optimal matching structure of the input web pages. Based on the alignment, the trees are merged into one union tree whose nodes record statistical information obtained from multiple web pages. A heuristic method is employed for determining the most probable content block and the alignment algorithm detects repeating patterns on the union tree. A wrapper built on the most probable content block and the repeating patterns extracts data from web pages. Experimental results show that the method achieves high extraction accuracy and has steady performance.