C4.5: programs for machine learning
C4.5: programs for machine learning
Annotation-based Web content transcoding
Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
IEPAD: information extraction based on pattern discovery
Proceedings of the 10th international conference on World Wide Web
Learning object identification rules for information integration
Information Systems - Data extraction, cleaning and reconciliation
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
Understanding Web query interfaces: best-effort parsing with hidden syntax
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Web data extraction based on partial tree alignment
WWW '05 Proceedings of the 14th international conference on World Wide Web
Using visual cues for extraction of tabular data from arbitrary HTML documents
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Visually guided bottom-up table detection and segmentation in web documents
Proceedings of the 15th international conference on World Wide Web
A DOM tree alignment model for mining parallel data from the web
ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Towards domain-independent information extraction from web tables
Proceedings of the 16th international conference on World Wide Web
Web article extraction for web printing: a DOM+visual based approach
Proceedings of the 9th ACM symposium on Document engineering
Boilerplate detection using shallow text features
Proceedings of the third ACM international conference on Web search and data mining
A unified ontology-based web page model for improving accessibility
Proceedings of the 19th international conference on World wide web
Hearsay: a new generation context-driven multi-modal assistive web browser
Proceedings of the 19th international conference on World wide web
Modelling web navigation with the user in mind
Proceedings of the 2010 International Cross Disciplinary Conference on Web Accessibility (W4A)
Document structure meets page layout: loopy random fields for web news content extraction
Proceedings of the 10th ACM symposium on Document engineering
SXPath: extending XPath towards spatial querying on web documents
Proceedings of the VLDB Endowment
A versatile model for web page representation, information extraction and content re-packaging
Proceedings of the 11th ACM symposium on Document engineering
SILA: a spatial instance learning approach for deep webpages
Proceedings of the 20th ACM international conference on Information and knowledge management
Web object identification for web automation and meta-search
Proceedings of the 3rd International Conference on Web Intelligence, Mining and Semantics
Hi-index | 0.00 |
In this paper, we address automatic identification of common functional structures on web pages, a fundamental problem for web automation applications and graphical user interface testing. In contrast to other approaches, we aim to identify relevant patterns without relying on the source code of a web page or keywords, utilizing mostly geometrical and visually perceptible properties. We achieve this by transforming pages into an independent geometrical representation, on top of which we extract a set of features that allows us to employ traditional machine learning techniques for the identification task. We evaluate this approach by analyzing three typical scenarios, reviewing the obtained information retrieval key metrics and estimating the relevance of the chosen features. Our initial results demonstrate the feasibility of the proposed approach.