Filtering artificial texts with statistical machine learning techniques

Authors:
Thomas Lavergne;Tanguy Urvoy;François Yvon
Affiliations:
Orange Labs, Lannion, France and Telecom ParisTech, Paris, France;Orange Labs, Lannion, France;Univ Paris Sud 11 & LIMSI/CNRS, Orsay cedex, France
Venue:
Language Resources and Evaluation
Year:
2011

Citing 15
Cited 0

Statistical methods for speech recognition

Statistical methods for speech recognition
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Data mining: practical machine learning tools and techniques with Java implementations

Data mining: practical machine learning tools and techniques with Java implementations
Language Modeling for Information Retrieval

Language Modeling for Information Retrieval
An empirical study of smoothing techniques for language modeling

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
Adversarial classification

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Spam, damn spam, and statistics: using statistical analysis to locate spam web pages

Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004
Detecting phrase-level duplication on the world wide web

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Adversarial learning

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Detecting spam web pages through content analysis

Proceedings of the 15th international conference on World Wide Web
Strategies for retrieving plagiarized documents

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Fighting Spam on Social Web Sites: A Survey of Approaches and Future Challenges

IEEE Internet Computing
Combating web spam with trustrank

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Tracking Web spam with HTML style similarities

ACM Transactions on the Web (TWEB)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Fake content is flourishing on the Internet, ranging from basic random word salads to web scraping. Most of this fake content is generated for the purpose of nourishing fake web sites aimed at biasing search engine indexes: at the scale of a search engine, using automatically generated texts render such sites harder to detect than using copies of existing pages. In this paper, we present three methods aimed at distinguishing natural texts from artificially generated ones: the first method uses basic lexicometric features, the second one uses standard language models and the third one is based on a relative entropy measure which captures short range dependencies between words. Our experiments show that lexicometric features and language models are efficient to detect most generated texts, but fail to detect texts that are generated with high order Markov models. By comparison our relative entropy scoring algorithm, especially when trained on a large corpus, allows us to detect these "hard" text generators with a high degree of accuracy.