Detection of text quality flaws as a one-class classification problem

Authors:
Maik Anderka;Benno Stein;Nedim Lipka
Affiliations:
Bauhaus-Universität Weimar, Weimar, Germany;Bauhaus-Universität Weimar, Weimar, Germany;Bauhaus-Universität Weimar, Weimar, Germany
Venue:
Proceedings of the 20th ACM international conference on Information and knowledge management
Year:
2011

Citing 12
Cited 3

Random Forests

Machine Learning
Authorship verification as a one-class classification problem

ICML '04 Proceedings of the twenty-first international conference on Machine learning
A Survey of Outlier Detection Methodologies

Artificial Intelligence Review
Finding high-quality content in social media

WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
One-Class Classification by Combining Density and Class Probability Estimation

ECML PKDD '08 Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases - Part I
User generated content: how good is it?

Proceedings of the 3rd workshop on Information credibility on the web
Anomaly detection: A survey

ACM Computing Surveys (CSUR)
Automatic quality assessment of content created collaboratively by web communities: a case study of wikipedia

Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Wikibugs: using template messages in open content collections

Proceedings of the 5th International Symposium on Wikis and Open Collaboration
Identifying featured articles in wikipedia: writing style matters

Proceedings of the 19th international conference on World wide web
Intrinsic plagiarism analysis

Language Resources and Evaluation
Towards automatic quality assurance in Wikipedia

Proceedings of the 20th international conference companion on World wide web

A breakdown of quality flaws in Wikipedia

Proceedings of the 2nd Joint WICOW/AIRWeb Workshop on Web Quality
Predicting quality flaws in user-generated content: the case of wikipedia

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Open-Set classification for automated genre identification

ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

For Web applications that are based on user generated content the detection of text quality flaws is a key concern. Our research contributes to automatic quality flaw detection. In particular, we propose to cast the detection of text quality flaws as a one-class classification problem: we are given only positive examples (= texts containing a particular quality flaw) and decide whether or not an unseen text suffers from this flaw. We argue that common binary or multiclass classification approaches are ineffective in here, and we underpin our approach by a real-world application: we employ a dedicated one-class learning approach to determine whether a given Wikipedia article suffers from certain quality flaws. Since in the Wikipedia setting the acquisition of sensible test data is quite intricate, we analyze the effects of a biased sample selection. In addition, we illustrate the classifier effectiveness as a function of the flaw distribution in order to cope with the unknown (real-world) flaw-specific class imbalances. Altogether, provided test data with little noise, four from ten important quality flaws in Wikipedia can be detected with a precision close to 1.