Web data extracion using visual features

Authors:
V. Padmadas;J. Gadge
Affiliations:
Thadomal Shahani Engg College, Mumbai;Thadomal Shahani Engg College, Mumbai
Venue:
Proceedings of the International Conference and Workshop on Emerging Trends in Technology
Year:
2010

Citing 10
Cited 0

A hierarchical approach to wrapper induction

Proceedings of the third annual conference on Autonomous Agents
Wrapper induction: efficiency and expressiveness

Artificial Intelligence - Special issue on Intelligent internet systems
IEPAD: information extraction based on pattern discovery

Proceedings of the 10th international conference on World Wide Web
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
SG-WRAP: A Schema-Guided Wrapper Generator

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Mining data records in Web pages

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Structured databases on the web: observations and implications

ACM SIGMOD Record
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
ViDE: A Vision-Based Approach for Deep Web Data Extraction

IEEE Transactions on Knowledge and Data Engineering
Extracting content structure for web pages based on visual representation

APWeb'03 Proceedings of the 5th Asia-Pacific web conference on Web technologies and applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Automatic data extraction from Web pages is a challenging yet significant problem in the fields of Information Retrieval and Data Mining. The problem arises particularly on the World-Wide Web, because search engines wrap up the results of user queries on web response pages. These response pages are often decorated with side bars, branding banners and advertisements. Automatic data extraction therefore has to deal with extracting relevant data from these pages Though many automated and manual text analysis solutions to this problem exist, most of them are heavily dependent on the specifics of HTML and they have to be changed according to the changes in markup language. This paper proposes, a novel and language independent technique to solve the data extraction problem using a combined approach that make use of features of DOM tree and also the visual features of html elements.