Making large-scale support vector machine learning practical
Advances in kernel methods
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Splog detection using self-similarity analysis on blog temporal dynamics
AIRWeb '07 Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
Detecting spam blogs: a machine learning approach
AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
Studying the effects of noisy text on text mining applications
Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data
Learning Age and Gender of Blogger from Stylistic Variation
PReMI '09 Proceedings of the 3rd International Conference on Pattern Recognition and Machine Intelligence
Learning age and gender using co-occurrence of non-dictionary words from stylistic variations
RSCTC'10 Proceedings of the 7th international conference on Rough sets and current trends in computing
Hi-index | 0.00 |
Language usage in Blogs deviate from the language used in traditional corpora largely due to the noise from various causes like spelling errors, grammatical irregularity, overuse of abbreviations and symbolic characters like emoticons. Spam Blogs or Splogs comprise the subset of blogs, which are usually written to target specific audience for marketing promotions and are mostly generated by software that readily imitates Zipfian distribution of words. Therefore it becomes a difficult task to separate splogs from non-splogs using only frequentist distribution of unigrams. In this detailed comparative study we present and highlight several additional statistical features of language, which are hard to imitate and serve as good discriminator between splogs and blogs.