A comparative study of statistical features of language in blogs-vs-splogs

Authors:
Soumya Datta;Sudeshna Sarkar
Affiliations:
Indian Institute of Technology, Kharagpur, India;Indian Institute of Technology, Kharagpur, India
Venue:
Proceedings of the second workshop on Analytics for noisy unstructured text data
Year:
2008

Citing 4
Cited 3

Making large-scale support vector machine learning practical

Advances in kernel methods
Density-based spam detector

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Splog detection using self-similarity analysis on blog temporal dynamics

AIRWeb '07 Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
Detecting spam blogs: a machine learning approach

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2

Studying the effects of noisy text on text mining applications

Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data
Learning Age and Gender of Blogger from Stylistic Variation

PReMI '09 Proceedings of the 3rd International Conference on Pattern Recognition and Machine Intelligence
Learning age and gender using co-occurrence of non-dictionary words from stylistic variations

RSCTC'10 Proceedings of the 7th international conference on Rough sets and current trends in computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Language usage in Blogs deviate from the language used in traditional corpora largely due to the noise from various causes like spelling errors, grammatical irregularity, overuse of abbreviations and symbolic characters like emoticons. Spam Blogs or Splogs comprise the subset of blogs, which are usually written to target specific audience for marketing promotions and are mostly generated by software that readily imitates Zipfian distribution of words. Therefore it becomes a difficult task to separate splogs from non-splogs using only frequentist distribution of unigrams. In this detailed comparative study we present and highlight several additional statistical features of language, which are hard to imitate and serve as good discriminator between splogs and blogs.