Purely URL-based topic classification

Authors:
Eda Baykan;Monika Henzinger;Ludmila Marian;Ingmar Weber
Affiliations:
Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland;Ecole Polytechnique Fédérale de Lausanne & Google Züürich, Lausanne, Switzerland;Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland;Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
Venue:
Proceedings of the 18th international conference on World wide web
Year:
2009

Citing 4
Cited 15

Enhanced hypertext categorization using hyperlinks

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Fast webpage classification using URL features

Proceedings of the 14th ACM international conference on Information and knowledge management
Knowing a web page by the company it keeps

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Web page classification: Features and algorithms

ACM Computing Surveys (CSUR)

Classifying Documents According to Locational Relevance

EPIA '09 Proceedings of the 14th Portuguese Conference on Artificial Intelligence: Progress in Artificial Intelligence
Learning URL patterns for webpage de-duplication

Proceedings of the third ACM international conference on Web search and data mining
Scalable techniques for document identifier assignment in inverted indexes

Proceedings of the 19th international conference on World wide web
A characterization of online browsing behavior

Proceedings of the 19th international conference on World wide web
Kairos: proactive harvesting of research paper metadata from scientific conference web sites

ICADL'10 Proceedings of the role of digital libraries in a time of global change, and 12th international conference on Asia-Pacific digital libraries
Design and implementation of contextual information portals

Proceedings of the 20th international conference companion on World wide web
The missing links: discovering hidden same-as links among a billion of triples

Proceedings of the 12th International Conference on Information Integration and Web-based Applications & Services
The SHARC framework for data quality in Web archiving

The VLDB Journal — The International Journal on Very Large Data Bases
A Comprehensive Study of Features and Algorithms for URL-Based Topic Classification

ACM Transactions on the Web (TWEB)
A statistical approach to URL-based web page clustering

Proceedings of the 21st international conference companion on World Wide Web
Extracting advertising keywords from URL strings

Proceedings of the 21st international conference companion on World Wide Web
Topic-Sensitive hidden-web crawling

WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering
A Comprehensive Study of Techniques for URL-Based Web Page Language Classification

ACM Transactions on the Web (TWEB)
Cost-sensitive online active learning with application to malicious URL detection

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
CALA: An unsupervised URL-based web page classification system

Knowledge-Based Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Given only the URL of a web page, can we identify its topic? This is the question that we examine in this paper. Usually, web pages are classified using their content, but a URL-only classifier is preferable, (i) when speed is crucial, (ii) to enable content filtering before an (objection-able) web page is downloaded, (iii) when a page's content is hidden in images, (iv) to annotate hyperlinks in a personalized web browser, without fetching the target page, and (v) when a focused crawler wants to infer the topic of a target page before devoting bandwidth to download it. We apply a machine learning approach to the topic identification task and evaluate its performance in extensive experiments on categorized web pages from the Open Directory Project (ODP). When training separate binary classifiers for each topic, we achieve typical F-measure values between 80 and 85, and a typical precision of around 85. We also ran experiments on a small data set of university web pages. For the task of classifying these pages into faculty, student, course and project pages, our methods improve over previous approaches by 13.8 points of F-measure.