Weblog classification for fast splog filtering: a URL language model segmentation approach

Authors:
Franco Salvetti;Nicolas Nicolov
Affiliations:
Univ. of Colorado at Boulder, Boulder, CO and Umbria, Inc., Boulder, CO;Umbria, Inc., Boulder, CO
Venue:
NAACL-Short '06 Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers
Year:
2006

Citing 4
Cited 8

Deriving marketing intelligence from online discussion

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Fast webpage classification using URL features

Proceedings of the 14th ACM international conference on Information and knowledge management
Speech and Language Processing (2nd Edition)

Speech and Language Processing (2nd Edition)
Combating web spam with trustrank

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30

Detecting splogs via temporal dynamics using self-similarity analysis

ACM Transactions on the Web (TWEB)
Splog Filtering Based on Writing Consistency

WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
A comparison of fraud cues and classification methods for fake escrow website detection

Information Technology and Management
Data-driven compound splitting method for english compounds in domain names

Proceedings of the 18th ACM conference on Information and knowledge management
Detecting spam blogs from blog search results

Information Processing and Management: an International Journal
Web scale NLP: a case study on url word breaking

Proceedings of the 20th international conference on World wide web
Detecting fake websites: the contribution of statistical learning theory

MIS Quarterly
Detecting Fake Medical Web Sites Using Recursive Trust Labeling

ACM Transactions on Information Systems (TOIS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper shows that in the context of statistical weblog classification for splog filtering based on n-grams of tokens in the URL, further segmenting the URLs beyond the standard punctuation is helpful. Many splog URLs contain phrases in which the words are glued together in order to avoid splog filtering techniques based on punctuation segmentation and unigrams. A technique which segments long tokens into the words forming the phrase is proposed and evaluated. The resulting tokens are used as features for a weblog classifier whose accuracy is similar to that of humans (78% vs. 76%) and reaches 93.3% of precision in identifying splogs with recall of 50.9%.