Elements of information theory
Elements of information theory
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss
Machine Learning - Special issue on learning with probabilistic representations
Enhanced hypertext categorization using hyperlinks
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Making large-scale support vector machine learning practical
Advances in kernel methods
Focused crawling: a new approach to topic-specific Web resource discovery
WWW '99 Proceedings of the eighth international conference on World Wide Web
Bringing order to the Web: automatically categorizing search results
Proceedings of the SIGCHI conference on Human Factors in Computing Systems
Using urls and table layout for web classification tasks
Proceedings of the 13th international conference on World Wide Web
Web taxonomy integration using support vector machines
Proceedings of the 13th international conference on World Wide Web
Web-page classification through summarization
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Web page classification without the web page
Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters
Fast webpage classification using URL features
Proceedings of the 14th ACM international conference on Information and knowledge management
Knowing a web page by the company it keeps
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
P-TAG: large scale automatic generation of personalized annotation tags for the web
Proceedings of the 16th international conference on World Wide Web
The Role of URLs in Objectionable Web Content Categorization
WI '06 Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
Machine Learning Techniques for Automated Web Page Classification Using URL Features
ICCIMA '07 Proceedings of the International Conference on Computational Intelligence and Multimedia Applications (ICCIMA 2007) - Volume 02
Genre Categorization of Web Pages
ICDMW '07 Proceedings of the Seventh IEEE International Conference on Data Mining Workshops
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Classifiers without borders: incorporating fielded text from neighboring web pages
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Information Retrieval
Introduction to Information Retrieval
De-duping URLs via rewrite rules
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Tag Recommendations in Folksonomies
PKDD 2007 Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases
Web page classification: Features and algorithms
ACM Computing Surveys (CSUR)
Purely URL-based topic classification
Proceedings of the 18th international conference on World wide web
FCA-MERGE: bottom-up merging of ontologies
IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 1
Learning URL patterns for webpage de-duplication
Proceedings of the third ACM international conference on Web search and data mining
Sorting out the document identifier assignment problem
ECIR'07 Proceedings of the 29th European conference on IR research
A large scale taxonomy mapping evaluation
ISWC'05 Proceedings of the 4th international conference on The Semantic Web
A Comprehensive Study of Techniques for URL-Based Web Page Language Classification
ACM Transactions on the Web (TWEB)
Semantic Formalization of Cross-Site User Browsing Behavior
WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Hi-index | 0.00 |
Given only the URL of a Web page, can we identify its topic? We study this problem in detail by exploring a large number of different feature sets and algorithms on several datasets. We also show that the inherent overlap between topics and the sparsity of the information in URLs makes this a very challenging problem. Web page classification without a page’s content is desirable when the content is not available at all, when a classification is needed before obtaining the content, or when classification speed is of utmost importance. For our experiments we used five different corpora comprising a total of about 3 million (URL, classification) pairs. We evaluated several techniques for feature generation and classification algorithms. The individual binary classifiers were then combined via boosting into metabinary classifiers. We achieve typical F-measure values between 80 and 85, and a typical precision of around 86. The precision can be pushed further over 90 while maintaining a typical level of recall between 30 and 40.