Data extraction for search engine using safe matching

Authors:
Jer Lang Hong;Ee Xion Tan;Fariza Fauzi
Affiliations:
School of Computing and IT, Taylor's University, Malaysia;School of IT, Monash University;School of IT, Monash University
Venue:
AI'11 Proceedings of the 24th international conference on Advances in Artificial Intelligence
Year:
2011

Citing 8
Cited 0

Mining data records in Web pages

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Fully automatic wrapper generation for search engines

WWW '05 Proceedings of the 14th international conference on World Wide Web
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
ViPER: augmenting automatic information extraction with visual perceptions

Proceedings of the 14th ACM international conference on Information and knowledge management
Extracting data records from the web using tag path clustering

Proceedings of the 18th international conference on World wide web
ODE: Ontology-assisted data extraction

ACM Transactions on Database Systems (TODS)
Information extraction for search engines using fast heuristic techniques

Data & Knowledge Engineering
ViDE: A Vision-Based Approach for Deep Web Data Extraction

IEEE Transactions on Knowledge and Data Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Our study shows that algorithms used to check the similarity of data records affect the efficiency of a wrapper. A closer examination indicates that the accuracy of a wrapper can be improved if the DOM Tree and visual properties of data records can be fully utilized. In this paper, we develop algorithms to check the similarity of data records based on the distinct tags and visual cue of the tree structure of data records and the voting algorithm which can detect the similarity of data records of a relevant data region which may contain irrelevant information such as search identifiers to distinguish the potential data regions more correctly and eliminate data region only when necessary. Experimental results show that our wrapper performs better than state of the art wrapper WISH and it is highly effective in data extraction. This wrapper will be useful for meta search engine application, which needs an accurate tool to locate its source of information.