Advanced information retrieval from web pages

Authors:
A. Vedeshin
Affiliations:
Tallinn University of Technology, Tallinn, Estonia
Venue:
FDIA'07 Proceedings of the 1st BCS IRSG conference on Future Directions in Information Access
Year:
2007

Citing 9
Cited 0

Seeing the whole in parts: text summarization for web browsing on handheld devices

Proceedings of the 10th international conference on World Wide Web
Effective Web data extraction with standard XML technologies

Proceedings of the 10th international conference on World Wide Web
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Genetic Mining of HTML Structures for Effective Web-Document Retrieval

Applied Intelligence
Discovering informative content blocks from Web documents

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
DOM-based content extraction of HTML documents

WWW '03 Proceedings of the 12th international conference on World Wide Web
Using visual cues for extraction of tabular data from arbitrary HTML documents

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Title extraction from bodies of HTML documents and its application to web page retrieval

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Visualization and structure analysis of legislative acts: a case study on the law of obligations

Proceedings of the 11th international conference on Artificial intelligence and law

Quantified Score

Hi-index	0.00

Visualization

Abstract

A lightweight, web based with near to real-time speed algorithm is proposed in this work. It is able to retrieve main parts (menu, main text, header and footer) of a randomly selected web page entirely using CSS, JavaScript, frames, layers, images, etc. for retrieval. Moreover shortcomings of well-known modern algorithms for content retrieval from web pages are discussed in this proposal. The algorithm is useful for the improvement of existing: searching, content matching, summaries making, web graph calculation, and etc. engines. Moreover it is practical as a data provider for classification and data mining. The experimental results of a PHP realization of the algorithm showed near to real-time speed, 20-25% error rate for the multipurpose mode and less than 1% error rate for the specific mode.