Extraction and classification of dense implicit communities in the Web graph

Authors:
Yon Dourisboure;Filippo Geraci;Marco Pellegrini
Affiliations:
LIUPPA - Université de Pau et des Pays de l'Adour, Pau Cedex, France;Istituto di Informatica e Telematica—CNR, Pisa, Italy;Istituto di Informatica e Telematica—CNR, Pisa, Italy
Venue:
ACM Transactions on the Web (TWEB)
Year:
2009

Citing 31
Cited 5

Elements of information theory

Elements of information theory
Referral Web: combining social networks and collaborative filtering

Communications of the ACM
Inferring Web communities from link topology

Proceedings of the ninth ACM conference on Hypertext and hypermedia : links, objects, time and space---structure in hypermedia systems: links, objects, time and space---structure in hypermedia systems
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Trawling the Web for emerging cyber-communities

WWW '99 Proceedings of the eighth international conference on World Wide Web
Graph structure in the Web

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
The stochastic approach for link-structure analysis (SALSA) and the TKC effect

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Efficient identification of Web communities

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Min-wise independent permutations

Journal of Computer and System Sciences - 30th annual ACM symposium on theory of computing
A comparison of techniques to find mirrored hosts on the WWW

Journal of the American Society for Information Science
Approximation algorithms for maximization problems arising in graph partitioning

Journal of Algorithms
Relations between average case complexity and approximation complexity

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Using web structure for classifying and describing web pages

Proceedings of the 11th international conference on World Wide Web
Mining the Web's Link Structure

Computer
Self-Organization and Identification of Web Communities

Computer
Extracting Large-Scale Knowledge Bases from the Web

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Massive Quasi-Clique Detection

LATIN '02 Proceedings of the 5th Latin American Symposium on Theoretical Informatics
Finding a Web Community by Maximum Flow Algorithm with HITS Score Based Capacity

DASFAA '03 Proceedings of the Eighth International Conference on Database Systems for Advanced Applications
An Approach to Relate the Web Communities through Bipartite Graphs

WISE '01 Proceedings of the Second International Conference on Web Information Systems Engineering (WISE'01) Volume 1 - Volume 1
The connectivity sonar: detecting site functionality by structural patterns

Proceedings of the fourteenth ACM conference on Hypertext and hypermedia
The webgraph framework I: compression techniques

Proceedings of the 13th international conference on World Wide Web
UbiCrawler: a scalable fully distributed web crawler

Software—Practice & Experience
Inside PageRank

ACM Transactions on Internet Technology (TOIT)
Partitioning of Web graphs by community topology

WWW '05 Proceedings of the 14th international conference on World Wide Web
Identifying link farm spam pages

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
The indexable web is more than 11.5 billion pages

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Discovering large dense subgraphs in massive graphs

VLDB '05 Proceedings of the 31st international conference on Very large data bases
A Voting Method for the Classification of Web Pages

WI-IATW '06 Proceedings of the 2006 IEEE/WIC/ACM international conference on Web Intelligence and Intelligent Agent Technology
Extraction and classification of dense communities in the web

Proceedings of the 16th international conference on World Wide Web
Classifying web sites

Proceedings of the 16th international conference on World Wide Web
Classifying web data in directory structures

APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development

b-Bit minwise hashing

Proceedings of the 19th international conference on World wide web
Clustering web pages to facilitate revisitation on mobile devices

Proceedings of the 2012 ACM international conference on Intelligent User Interfaces
AutoWeb: automatic classification of mobile web pages for revisitation

MobileHCI '12 Proceedings of the 14th international conference on Human-computer interaction with mobile devices and services
An approach for using Wikipedia to measure the flow of trends across countries

Proceedings of the 22nd international conference on World Wide Web companion
Dense subgraph mining with a mixed graph model

Pattern Recognition Letters

Quantified Score

Hi-index	0.00

Visualization

Abstract

The World Wide Web (WWW) is rapidly becoming important for society as a medium for sharing data, information, and services, and there is a growing interest in tools for understanding collective behavior and emerging phenomena in the WWW. In this article we focus on the problem of searching and classifying communities in the Web. Loosely speaking a community is a group of pages related to a common interest. More formally, communities have been associated in the computer science literature with the existence of a locally dense subgraph of the Web graph (where Web pages are nodes and hyperlinks are arcs of the Web graph). The core of our contribution is a new scalable algorithm for finding relatively dense subgraphs in massive graphs. We apply our algorithm on Web graphs built on three publicly available large crawls of the Web (with raw sizes up to 120M nodes and 1G arcs). The effectiveness of our algorithm in finding dense subgraphs is demonstrated experimentally by embedding artificial communities in the Web graph and counting how many of these are blindly found. Effectiveness increases with the size and density of the communities: it is close to 100% for communities of thirty nodes or more (even at low density). It is still about 80% even for communities of twenty nodes with density over 50% of the arcs present. At the lower extremes the algorithm catches 35% of dense communities made of ten nodes. We also develop some sufficient conditions for the detection of a community under some local graph models and not-too-restrictive hypotheses. We complete our Community Watch system by clustering the communities found in the Web graph into homogeneous groups by topic and labeling each group by representative keywords.