Characterizing a national community web

Authors:
Daniel Gomes;Mário J. Silva
Affiliations:
University of Lisbon, Lisboa, Portugal;University of Lisbon, Lisboa, Portugal
Venue:
ACM Transactions on Internet Technology (TOIT)
Year:
2005

Citing 18
Cited 13

Inferring Web communities from link topology

Proceedings of the ninth ACM conference on Hypertext and hypermedia : links, objects, time and space---structure in hypermedia systems: links, objects, time and space---structure in hypermedia systems
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Summary of WWW characterizations

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Towards a better understanding of Web resources and server responses for improved caching

WWW '99 Proceedings of the eighth international conference on World Wide Web
Mirror, mirror on the Web: a study of host pairs with replicated content

WWW '99 Proceedings of the eighth international conference on World Wide Web
Graph structure in the Web

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Efficient identification of Web communities

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Aliasing on the world wide web: prevalence and performance implications

Proceedings of the 11th international conference on World Wide Web
Mercator: A scalable, extensible Web crawler

World Wide Web
The decay and failures of web references

Communications of the ACM
The Evolution of the Web and Implications for an Incremental Crawler

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Finding Near-Replicas of Documents and Servers on the Web

WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Who Links to Whom: Mining Linkage between Web Sites

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
A large-scale study of the evolution of web pages

WWW '03 Proceedings of the 12th international conference on World Wide Web
GTrace - A Graphical Traceroute Tool

LISA '99 Proceedings of the 13th USENIX conference on System administration
Rate of change and other metrics: a live study of the world wide web

USITS'97 Proceedings of the USENIX Symposium on Internet Technologies and Systems on USENIX Symposium on Internet Technologies and Systems

Managing duplicates in a web archive

Proceedings of the 2006 ACM symposium on Applied computing
Modelling information persistence on the web

ICWE '06 Proceedings of the 6th international conference on Web engineering
Evaluation of crawling policies for a web-repository crawler

Proceedings of the seventeenth conference on Hypertext and hypermedia
Characterization of national Web domains

ACM Transactions on Internet Technology (TOIT)
The Viúva Negra crawler: an experience report

Software—Practice & Experience
Using a fuzzy classification approach to assess e-commerce Web sites: An empirical investigation

ACM Transactions on Internet Technology (TOIT)
How are web characteristics evolving?

Proceedings of the 20th ACM conference on Hypertext and hypermedia
Web Crawling

Foundations and Trends in Information Retrieval
Sampling the national deep web

DEXA'11 Proceedings of the 22nd international conference on Database and expert systems applications - Volume Part I
Databases on the web: national web domain survey

Proceedings of the 15th Symposium on International Database Engineering & Applications
Portuguese at CLEF 2005

CLEF'05 Proceedings of the 6th international conference on Cross-Language Evalution Forum: accessing Multilingual Information Repositories
Design and selection criteria for a national web archive

ECDL'06 Proceedings of the 10th European conference on Research and Advanced Technology for Digital Libraries
Question answering beyond CLEF document collections

CLEF'06 Proceedings of the 7th international conference on Cross-Language Evaluation Forum: evaluation of multilingual and multi-modal information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

This article presents a characterization of the community Web of the people of Portugal. We defined criteria for delimiting this Web based on our past experience of crawling pages related to Portugal and collected over 3.2 million documents from 46,000 sites satisfying those criteria. Our characterization was derived from this crawl. We describe the rules that we established for defining the boundaries of this community Web and the methodology used to gather statistics. Statistics cover the number and domain distribution of sites; the number, type and size distribution of text documents; and the linkage structure of this Web. We also show how crawling constraints and abnormal situations on the Web can influence the statistics.