A Comprehensive Study of Features and Algorithms for URL-Based Topic Classification

Authors:
Eda Baykan;Monika Henzinger;Ludmila Marian;Ingmar Weber
Affiliations:
Izmir University;University of Vienna;CERN;Yahoo! Research
Venue:
ACM Transactions on the Web (TWEB)
Year:
2011

Citing 28
Cited 2

Elements of information theory

Elements of information theory
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss

Machine Learning - Special issue on learning with probabilistic representations
Enhanced hypertext categorization using hyperlinks

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Making large-scale support vector machine learning practical

Advances in kernel methods
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Bringing order to the Web: automatically categorizing search results

Proceedings of the SIGCHI conference on Human Factors in Computing Systems
Using urls and table layout for web classification tasks

Proceedings of the 13th international conference on World Wide Web
Web taxonomy integration using support vector machines

Proceedings of the 13th international conference on World Wide Web
Web-page classification through summarization

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Web page classification without the web page

Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters
Fast webpage classification using URL features

Proceedings of the 14th ACM international conference on Information and knowledge management
Knowing a web page by the company it keeps

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
P-TAG: large scale automatic generation of personalized annotation tags for the web

Proceedings of the 16th international conference on World Wide Web
The Role of URLs in Objectionable Web Content Categorization

WI '06 Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
Machine Learning Techniques for Automated Web Page Classification Using URL Features

ICCIMA '07 Proceedings of the International Conference on Computational Intelligence and Multimedia Applications (ICCIMA 2007) - Volume 02
Genre Categorization of Web Pages

ICDMW '07 Proceedings of the Seventh IEEE International Conference on Data Mining Workshops
Social tag prediction

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Classifiers without borders: incorporating fielded text from neighboring web pages

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Information Retrieval

Introduction to Information Retrieval
De-duping URLs via rewrite rules

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Tag Recommendations in Folksonomies

PKDD 2007 Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases
Web page classification: Features and algorithms

ACM Computing Surveys (CSUR)
Purely URL-based topic classification

Proceedings of the 18th international conference on World wide web
FCA-MERGE: bottom-up merging of ontologies

IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 1
Learning URL patterns for webpage de-duplication

Proceedings of the third ACM international conference on Web search and data mining
Sorting out the document identifier assignment problem

ECIR'07 Proceedings of the 29th European conference on IR research
A large scale taxonomy mapping evaluation

ISWC'05 Proceedings of the 4th international conference on The Semantic Web

A Comprehensive Study of Techniques for URL-Based Web Page Language Classification

ACM Transactions on the Web (TWEB)
Semantic Formalization of Cross-Site User Browsing Behavior

WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01

Quantified Score

Hi-index	0.00

Visualization

Abstract

Given only the URL of a Web page, can we identify its topic? We study this problem in detail by exploring a large number of different feature sets and algorithms on several datasets. We also show that the inherent overlap between topics and the sparsity of the information in URLs makes this a very challenging problem. Web page classification without a page’s content is desirable when the content is not available at all, when a classification is needed before obtaining the content, or when classification speed is of utmost importance. For our experiments we used five different corpora comprising a total of about 3 million (URL, classification) pairs. We evaluated several techniques for feature generation and classification algorithms. The individual binary classifiers were then combined via boosting into metabinary classifiers. We achieve typical F-measure values between 80 and 85, and a typical precision of around 86. The precision can be pushed further over 90 while maintaining a typical level of recall between 30 and 40.