A hierarchical approach to wrapper induction
Proceedings of the third annual conference on Autonomous Agents
Wrapper induction: efficiency and expressiveness
Artificial Intelligence - Special issue on Intelligent internet systems
IEPAD: information extraction based on pattern discovery
Proceedings of the 10th international conference on World Wide Web
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
SG-WRAP: A Schema-Guided Wrapper Generator
ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Mining data records in Web pages
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Structured databases on the web: observations and implications
ACM SIGMOD Record
Web data extraction based on partial tree alignment
WWW '05 Proceedings of the 14th international conference on World Wide Web
ViDE: A Vision-Based Approach for Deep Web Data Extraction
IEEE Transactions on Knowledge and Data Engineering
Extracting content structure for web pages based on visual representation
APWeb'03 Proceedings of the 5th Asia-Pacific web conference on Web technologies and applications
Hi-index | 0.00 |
Automatic data extraction from Web pages is a challenging yet significant problem in the fields of Information Retrieval and Data Mining. The problem arises particularly on the World-Wide Web, because search engines wrap up the results of user queries on web response pages. These response pages are often decorated with side bars, branding banners and advertisements. Automatic data extraction therefore has to deal with extracting relevant data from these pages Though many automated and manual text analysis solutions to this problem exist, most of them are heavily dependent on the specifics of HTML and they have to be changed according to the changes in markup language. This paper proposes, a novel and language independent technique to solve the data extraction problem using a combined approach that make use of features of DOM tree and also the visual features of html elements.