The structure of broad topics on the web

Authors:
Soumen Chakrabarti;Mukul M. Joshi;Kunal Punera;David M. Pennock
Affiliations:
IIT Bombay;IIT Bombay;IIT Bombay;NEC Research Institute
Venue:
Proceedings of the 11th international conference on World Wide Web
Year:
2002

Citing 22
Cited 33

Elements of information theory

Elements of information theory
Randomized algorithms

Randomized algorithms
Enhanced hypertext categorization using hyperlinks

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
The connectivity server: fast access to linkage information on the Web

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Trawling the Web for emerging cyber-communities

WWW '99 Proceedings of the eighth international conference on World Wide Web
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Surfing the Web backwards

WWW '99 Proceedings of the eighth international conference on World Wide Web
Authoritative sources in a hyperlinked environment

Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms
On power-law relationships of the Internet topology

Proceedings of the conference on Applications, technologies, architectures, and protocols for computer communication
Topical locality in the Web

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
On near-uniform URL sampling

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Graph structure in the Web

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Breadth-first crawling yields high-quality pages

Proceedings of the 10th international conference on World Wide Web
Probabilistic combination of content and links

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Accelerated focused crawling through online relevance feedback

Proceedings of the 11th international conference on World Wide Web
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Self-Organization and Identification of Web Communities

Computer
Using Reinforcement Learning to Spider the Web Efficiently

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Focused Crawling Using Context Graphs

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Approximating Aggregate Queries about Web Pages via Random Walks

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Self-similarity in the Web

Proceedings of the 27th International Conference on Very Large Data Bases

Extrapolation methods for accelerating PageRank computations

WWW '03 Proceedings of the 12th international conference on World Wide Web
Building a web thesaurus from web link structure

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
The connectivity sonar: detecting site functionality by structural patterns

Proceedings of the fourteenth ACM conference on Hypertext and hypermedia
Finding similar academic web sites with links, bibliometric couplings and colinks

Information Processing and Management: an International Journal
Parameterized generation of labeled datasets for text categorization based on a hierarchical directory

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5

ICML '04 Proceedings of the twenty-first international conference on Machine learning
A Report of Activities at the WIC-India Research Center

WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence
Exploiting Interclass Rules for Focused Crawling

IEEE Intelligent Systems
A Web Surfer Model Incorporating Topic Continuity

IEEE Transactions on Knowledge and Data Engineering
A modeling approach to uncover hyperlink patterns: the case of Canadian universities

Information Processing and Management: an International Journal
A General Evaluation Framework for Topical Crawlers

Information Retrieval
Toward a basic framework for webometrics

Journal of the American Society for Information Science and Technology - Special issue: Webometrics
Web-crawling reliability

Journal of the American Society for Information Science and Technology - Special issue: Webometrics
Lexical and semantic clustering by web links

Journal of the American Society for Information Science and Technology - Special issue: Webometrics
Mapping the Semantics of Web Text and Links

IEEE Internet Computing
Topical TrustRank: using topicality to combat web spam

Proceedings of the 15th international conference on World Wide Web
Detecting semantic cloaking on the web

Proceedings of the 15th international conference on World Wide Web
Implementation and evaluation of a quality-based search engine

Proceedings of the seventeenth conference on Hypertext and hypermedia
Knowing a web page by the company it keeps

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Using similarity links as shortcuts to relevant web pages

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Enhancing digital libraries using missing content analysis

Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
Classifiers without borders: incorporating fielded text from neighboring web pages

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Web page classification: Features and algorithms

ACM Computing Surveys (CSUR)
A cross-language focused crawling algorithm based on multiple relevance prediction strategies

Computers & Mathematics with Applications
Multimedia data mining and searching through dynamic index evolution

VISUAL'07 Proceedings of the 9th international conference on Advances in visual information systems
Detection of web communities from community cores

WISS'10 Proceedings of the 2010 international conference on Web information systems engineering
Effective filtering for collaborative publishing

WINE'05 Proceedings of the First international conference on Internet and Network Economics
What's the deal with the web/blogs/the next big technology: a key role for information science in e-social science research?

CoLIS'05 Proceedings of the 5th international conference on Context: conceptions of Library and Information Sciences
Towards logical hypertext structure

IICS'04 Proceedings of the 4th international conference on Innovative Internet Community Systems
Net Increase? Cross-Lingual Linking in the Blogosphere

Journal of Computer-Mediated Communication
LBSNRank: personalized pagerank on location-based social networks

Proceedings of the 2012 ACM Conference on Ubiquitous Computing
Automatic seed set expansion for trust propagation based anti-spam algorithms

Information Sciences: an International Journal
Dynamic FOAF management method for social networks in the social web environment

The Journal of Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Web graph is a giant social network whose properties have been measured and modeled extensively in recent years. Most such studies concentrate on the graph structure alone, and do not consider textual properties of the nodes. Consequently, Web communities have been characterized purely in terms of graph structure and not on page content. We propose that a topic taxonomy such as Yahoo! or the Open Directory provides a useful framework for understanding the structure of content-based clusters and communities. In particular, using a topic taxonomy and an automatic classifier, we can measure the background distribution of broad topics on the Web, and analyze the capability of recent random walk algorithms to draw samples which follow such distributions. In addition, we can measure the probability that a page about one broad topic will link to another broad topic. Extending this experiment, we can measure how quickly topic context is lost while walking randomly on the Web graph. Estimates of this topic mixing distance may explain why a global PageRank is still meaningful in the context of broad queries. In general, our measurements may prove valuable in the design of community-specific crawlers and link-based ranking systems.