Deriving marketing intelligence from online discussion
Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Fast webpage classification using URL features
Proceedings of the 14th ACM international conference on Information and knowledge management
Speech and Language Processing (2nd Edition)
Speech and Language Processing (2nd Edition)
Combating web spam with trustrank
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Detecting splogs via temporal dynamics using self-similarity analysis
ACM Transactions on the Web (TWEB)
Splog Filtering Based on Writing Consistency
WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
A comparison of fraud cues and classification methods for fake escrow website detection
Information Technology and Management
Data-driven compound splitting method for english compounds in domain names
Proceedings of the 18th ACM conference on Information and knowledge management
Detecting spam blogs from blog search results
Information Processing and Management: an International Journal
Web scale NLP: a case study on url word breaking
Proceedings of the 20th international conference on World wide web
Detecting Fake Medical Web Sites Using Recursive Trust Labeling
ACM Transactions on Information Systems (TOIS)
Hi-index | 0.00 |
This paper shows that in the context of statistical weblog classification for splog filtering based on n-grams of tokens in the URL, further segmenting the URLs beyond the standard punctuation is helpful. Many splog URLs contain phrases in which the words are glued together in order to avoid splog filtering techniques based on punctuation segmentation and unigrams. A technique which segments long tokens into the words forming the phrase is proposed and evaluated. The resulting tokens are used as features for a weblog classifier whose accuracy is similar to that of humans (78% vs. 76%) and reaches 93.3% of precision in identifying splogs with recall of 50.9%.