Recognition of Common Areas in a Web Page Using a Visualization Approach

Authors:
Milos Kovacevic;Michelangelo Dilligenti;Marco Gori;Veljko M. Milutinovic
Affiliations:
-;-;-;-
Venue:
AIMSA '02 Proceedings of the 10th International Conference on Artificial Intelligence: Methodology, Systems, and Applications
Year:
2002

Citing 8
Cited 0

Record-boundary discovery in Web documents

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Induction of Decision Trees

Machine Learning
Using Reinforcement Learning to Spider the Web Efficiently

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Focused Crawling Using Context Graphs

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Extracting Structures of HTML Documents

ICOIN '98 Proceedings of the 13th International Conference on Information Networking
Representing structured information in audio interfaces: a framework for selecting audio marking techniques to represent document structures

Representing structured information in audio interfaces: a framework for selecting audio marking techniques to represent document structures

Quantified Score

Hi-index	0.00

Visualization

Abstract

Extracting and processing information from web pages is an important task in many areas like constructing search engines, information retrieval, and data mining from the Web. Common approach in the extraction process is to represent a page as a "bag of words" and then to perform an additional processing on such a flat representation. In this paper we propose a new, hierarchical representation that includes the browser screen coordinates for every HTML object in a page. Using a spatial information one is able to define heuristics for recognition of common page areas such as a header, left and right menu, footer and the center of a page. We show in initial experiments that using our heuristics, defined objects are recognized properly in 73% of cases.