A comparative study of statistical features of language in blogs-vs-splogs

  • Authors:
  • Soumya Datta;Sudeshna Sarkar

  • Affiliations:
  • Indian Institute of Technology, Kharagpur, India;Indian Institute of Technology, Kharagpur, India

  • Venue:
  • Proceedings of the second workshop on Analytics for noisy unstructured text data
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Language usage in Blogs deviate from the language used in traditional corpora largely due to the noise from various causes like spelling errors, grammatical irregularity, overuse of abbreviations and symbolic characters like emoticons. Spam Blogs or Splogs comprise the subset of blogs, which are usually written to target specific audience for marketing promotions and are mostly generated by software that readily imitates Zipfian distribution of words. Therefore it becomes a difficult task to separate splogs from non-splogs using only frequentist distribution of unigrams. In this detailed comparative study we present and highlight several additional statistical features of language, which are hard to imitate and serve as good discriminator between splogs and blogs.