Knowing a web page by the company it keeps

Authors:
Xiaoguang Qi;Brian D. Davison
Affiliations:
Lehigh University, Bethlehem, PA;Lehigh University, Bethlehem, PA
Venue:
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Year:
2006

Citing 17
Cited 12

Enhanced hypertext categorization using hyperlinks

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Making large-scale support vector machine learning practical

Advances in kernel methods
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Hierarchical classification of Web content

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
WebBase: a repository of Web pages

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
The structure of broad topics on the web

Proceedings of the 11th international conference on World Wide Web
Topic-sensitive PageRank

Proceedings of the 11th international conference on World Wide Web
Using web structure for classifying and describing web pages

Proceedings of the 11th international conference on World Wide Web
Strategies for minimising errors in hierarchical web categorisation

Proceedings of the eleventh international conference on Information and knowledge management
Mining the Web: Discovering Knowledge from HyperText Data

Mining the Web: Discovering Knowledge from HyperText Data
Discovering Test Set Regularities in Relational Domains

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
PEBL: Web Page Classification without Negative Examples

IEEE Transactions on Knowledge and Data Engineering
Combining link-based and content-based methods for web document classification

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Web-page classification through summarization

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Web page classification without the web page

Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters
Fast webpage classification using URL features

Proceedings of the 14th ACM international conference on Information and knowledge management
Topical link analysis for web search

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

Know your neighbors: web spam detection using the web topology

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Link analysis for Web spam detection

ACM Transactions on the Web (TWEB)
Classifiers without borders: incorporating fielded text from neighboring web pages

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Web page classification: Features and algorithms

ACM Computing Surveys (CSUR)
Purely URL-based topic classification

Proceedings of the 18th international conference on World wide web
Exploit the tripartite network of social tagging for web clustering

Proceedings of the 18th ACM conference on Information and knowledge management
Classifying Documents According to Locational Relevance

EPIA '09 Proceedings of the 14th Portuguese Conference on Artificial Intelligence: Progress in Artificial Intelligence
Semantic and cooperative document delivery over distributed systems

PDCN '08 Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Networks
Classifying documents with link-based bibliometric measures

Information Retrieval
Adversarial Web Search

Foundations and Trends in Information Retrieval
A Comprehensive Study of Features and Algorithms for URL-Based Topic Classification

ACM Transactions on the Web (TWEB)
Hierarchical web-page clustering via in-page and cross-page link structures

PAKDD'10 Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part II

Quantified Score

Hi-index	0.00

Visualization

Abstract

Web page classification is important to many tasks in information retrieval and web mining. However, applying traditional textual classifiers on web data often produces unsatisfying results. Fortunately, hyperlink information provides important clues to the categorization of a web page. In this paper, an improved method is proposed to enhance web page classification by utilizing the class information from neighboring pages in the link graph. The categories represented by four kinds of neighbors (parents, children, siblings and spouses) are combined to help with the page in question. In experiments to study the effect of these factors on our algorithm, we find that the method proposed is able to boost the classification accuracy of common textual classifiers from around 70% to more than 90% on a large dataset of pages from the Open Directory Project, and outperforms existing algorithms. Unlike prior techniques, our approach utilizes same-host links and can improve classification accuracy even when neighboring pages are unlabeled. Finally, while all neighbor types can contribute, sibling pages are found to be the most important.