A tool for link-based web page classification

Authors:
Inma Hernández;Carlos R. Rivero;David Ruiz;Rafael Corchuelo
Affiliations:
University of Seville, Seville, Spain;University of Seville, Seville, Spain;University of Seville, Seville, Spain;University of Seville, Seville, Spain
Venue:
CAEPIA'11 Proceedings of the 14th international conference on Advances in artificial intelligence: spanish association for artificial intelligence
Year:
2011

Citing 20
Cited 1

Authoritative sources in a hyperlinked environment

Journal of the ACM (JACM)
Automating Web navigation with the WebVCR

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
An adaptive model for optimizing performance of an incremental web crawler

Proceedings of the 10th international conference on World Wide Web
On the design of a learning crawler for topical resource discovery

ACM Transactions on Information Systems (TOIS)
Template detection via data mining and its applications

Proceedings of the 11th international conference on World Wide Web
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Semi-Automatic Wrapper Generation for Commercial Web Sources

Proceedings of the IFIP TC8 / WG8.1 Working Conference on Engineering Information Systems in the Internet Context
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Design and Implementation of a High-Performance Distributed Web Crawler

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Eliminating noisy information in Web pages for data mining

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Web page feature selection and classification using neural networks

Information Sciences—Informatics and Computer Science: An International Journal - Special issue: Informatics and computer science intelligent systems applications
Automatic generation of agents for collecting hidden web pages for data extraction

Data & Knowledge Engineering - Special issue: WIDM 2002
Discovering and Analyzing World Wide Web Collections

Knowledge and Information Systems
UbiCrawler: a scalable fully distributed web crawler

Software—Practice & Experience
QA-Pagelet: Data Preparation Techniques for Large-Scale Data Analysis of the Deep Web

IEEE Transactions on Knowledge and Data Engineering
Link Contexts in Classifier-Guided Topical Crawlers

IEEE Transactions on Knowledge and Data Engineering
A fast and robust method for web page template detection and removal

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
The hybrid representation model for web document classification

International Journal of Intelligent Systems
Exploiting genre in focused crawling

SPIRE'07 Proceedings of the 14th international conference on String processing and information retrieval
Highly efficient algorithms for structural clustering of large websites

Proceedings of the 20th international conference on World wide web

A statistical approach to URL-based web page clustering

Proceedings of the 21st international conference companion on World Wide Web

Quantified Score

Hi-index	0.00

Visualization

Abstract

Virtual integration systems require a crawler to navigate through web sites automatically, looking for relevant information. This process is online, so whilst the system is looking for the required information, the user is waiting for a response. Therefore, downloading a minimum number of irrelevant pages is mandatory to improve the crawler efficiency. Most crawlers need to download a page to determine its relevance, which results in a high number of irrelevant pages downloaded. In this paper, we propose a classifier that helps crawlers to efficiently navigate through web sites. This classifier is able to determine if a web page is relevant by analysing exclusively its URL, minimising the number of irrelevant pages downloaded, improving crawling efficiency and reducing used bandwidth, making it suitable for virtual integration systems.