Enhanced hypertext categorization using hyperlinks
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Fast webpage classification using URL features
Proceedings of the 14th ACM international conference on Information and knowledge management
Knowing a web page by the company it keeps
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Web page classification: Features and algorithms
ACM Computing Surveys (CSUR)
Classifying Documents According to Locational Relevance
EPIA '09 Proceedings of the 14th Portuguese Conference on Artificial Intelligence: Progress in Artificial Intelligence
Learning URL patterns for webpage de-duplication
Proceedings of the third ACM international conference on Web search and data mining
Scalable techniques for document identifier assignment in inverted indexes
Proceedings of the 19th international conference on World wide web
A characterization of online browsing behavior
Proceedings of the 19th international conference on World wide web
Kairos: proactive harvesting of research paper metadata from scientific conference web sites
ICADL'10 Proceedings of the role of digital libraries in a time of global change, and 12th international conference on Asia-Pacific digital libraries
Design and implementation of contextual information portals
Proceedings of the 20th international conference companion on World wide web
The missing links: discovering hidden same-as links among a billion of triples
Proceedings of the 12th International Conference on Information Integration and Web-based Applications & Services
The SHARC framework for data quality in Web archiving
The VLDB Journal — The International Journal on Very Large Data Bases
A Comprehensive Study of Features and Algorithms for URL-Based Topic Classification
ACM Transactions on the Web (TWEB)
A statistical approach to URL-based web page clustering
Proceedings of the 21st international conference companion on World Wide Web
Extracting advertising keywords from URL strings
Proceedings of the 21st international conference companion on World Wide Web
Topic-Sensitive hidden-web crawling
WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering
A Comprehensive Study of Techniques for URL-Based Web Page Language Classification
ACM Transactions on the Web (TWEB)
Cost-sensitive online active learning with application to malicious URL detection
Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
CALA: An unsupervised URL-based web page classification system
Knowledge-Based Systems
Hi-index | 0.00 |
Given only the URL of a web page, can we identify its topic? This is the question that we examine in this paper. Usually, web pages are classified using their content, but a URL-only classifier is preferable, (i) when speed is crucial, (ii) to enable content filtering before an (objection-able) web page is downloaded, (iii) when a page's content is hidden in images, (iv) to annotate hyperlinks in a personalized web browser, without fetching the target page, and (v) when a focused crawler wants to infer the topic of a target page before devoting bandwidth to download it. We apply a machine learning approach to the topic identification task and evaluate its performance in extensive experiments on categorized web pages from the Open Directory Project (ODP). When training separate binary classifiers for each topic, we achieve typical F-measure values between 80 and 85, and a typical precision of around 85. We also ran experiments on a small data set of university web pages. For the task of classifying these pages into faculty, student, course and project pages, our methods improve over previous approaches by 13.8 points of F-measure.