Statistical methods for speech recognition
Statistical methods for speech recognition
Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
Foundations of statistical natural language processing
Foundations of statistical natural language processing
Data mining: practical machine learning tools and techniques with Java implementations
Data mining: practical machine learning tools and techniques with Java implementations
Language Modeling for Information Retrieval
Language Modeling for Information Retrieval
An empirical study of smoothing techniques for language modeling
ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Spam, damn spam, and statistics: using statistical analysis to locate spam web pages
Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004
Detecting phrase-level duplication on the world wide web
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Detecting spam web pages through content analysis
Proceedings of the 15th international conference on World Wide Web
Strategies for retrieving plagiarized documents
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Fighting Spam on Social Web Sites: A Survey of Approaches and Future Challenges
IEEE Internet Computing
Combating web spam with trustrank
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Tracking Web spam with HTML style similarities
ACM Transactions on the Web (TWEB)
Hi-index | 0.00 |
Fake content is flourishing on the Internet, ranging from basic random word salads to web scraping. Most of this fake content is generated for the purpose of nourishing fake web sites aimed at biasing search engine indexes: at the scale of a search engine, using automatically generated texts render such sites harder to detect than using copies of existing pages. In this paper, we present three methods aimed at distinguishing natural texts from artificially generated ones: the first method uses basic lexicometric features, the second one uses standard language models and the third one is based on a relative entropy measure which captures short range dependencies between words. Our experiments show that lexicometric features and language models are efficient to detect most generated texts, but fail to detect texts that are generated with high order Markov models. By comparison our relative entropy scoring algorithm, especially when trained on a large corpus, allows us to detect these "hard" text generators with a high degree of accuracy.