Web community analysis and its application to language specific crawling

  • Authors:
  • Kulwadee Somboonviwat

  • Affiliations:
  • King Mongkut's Institute of Technology Ladkrabang (KMITL), Ladkrabang, Bangkok, Thailand

  • Venue:
  • Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper proposes a novel metric for web community analysis, called language homogeneity. The language homogeneity of a community measures the ratio of web pages in a specific language within the community. This simple web community analysis can provide additional insights on the characteristics of web communities. We analyze web communities extracted from large Thai web datasets in the following aspects: (1) community size distribution, (2) similarity with a web directory, and (3) Thai language homogeneity. Interestingly, we found that most Thai web communities are linguistically homogeneous. Web pages inside the same community tend to be written in the same language. Based on these analysis results, we argue that the linguistic homogeneity of web communities can be used to enhance language specific crawling. Towards this end, we point out current limitations of a language specific crawler and suggest possible ways for exploiting communities' language homogeneity to improve the performance of language specific crawling.