Weblog classification for fast splog filtering: a URL language model segmentation approach

  • Authors:
  • Franco Salvetti;Nicolas Nicolov

  • Affiliations:
  • Univ. of Colorado at Boulder, Boulder, CO and Umbria, Inc., Boulder, CO;Umbria, Inc., Boulder, CO

  • Venue:
  • NAACL-Short '06 Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper shows that in the context of statistical weblog classification for splog filtering based on n-grams of tokens in the URL, further segmenting the URLs beyond the standard punctuation is helpful. Many splog URLs contain phrases in which the words are glued together in order to avoid splog filtering techniques based on punctuation segmentation and unigrams. A technique which segments long tokens into the words forming the phrase is proposed and evaluated. The resulting tokens are used as features for a weblog classifier whose accuracy is similar to that of humans (78% vs. 76%) and reaches 93.3% of precision in identifying splogs with recall of 50.9%.