Predicting quality flaws in user-generated content: the case of wikipedia

Authors:
Maik Anderka;Benno Stein;Nedim Lipka
Affiliations:
Bauhaus-Universität Weimar, Weimar, Germany;Bauhaus-Universität Weimar, Weimar, Germany;Bauhaus-Universität Weimar, Weimar, Germany
Venue:
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Year:
2012

Citing 31
Cited 4

Bagging predictors

Machine Learning
Incorporating quality metrics in centralized/distributed information retrieval on the World Wide Web

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Machine Learning

Machine Learning
Random Forests

Machine Learning
AIMQ: a methodology for information quality assessment

Information and Management
Random decision forests

ICDAR '95 Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 1) - Volume 1
Bias and the probability of generalization

IIS '97 Proceedings of the 1997 IASTED International Conference on Intelligent Information Systems (IIS '97)
Building Text Classifiers Using Positive and Unlabeled Examples

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Studying cooperation and conflict between authors with history flow visualizations

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Collaborative Authoring on the Web: A Genre Analysis of Online Encyclopedias

HICSS '05 Proceedings of the Proceedings of the 38th Annual Hawaii International Conference on System Sciences (HICSS'05) - Track 4 - Volume 04
Document quality models for web ad hoc retrieval

Proceedings of the 14th ACM international conference on Information and knowledge management
Using intelligent task routing and contribution review to help communities build artifacts of lasting value

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Beyond accuracy: what data quality means to data consumers

Journal of Management Information Systems
A content-driven reputation system for the wikipedia

Proceedings of the 16th international conference on World Wide Web
Cooperation and quality in wikipedia

Proceedings of the 2007 international symposium on Wikis
Measuring article quality in wikipedia: models and evaluation

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Finding high-quality content in social media

WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
Information quality work organization in wikipedia

Journal of the American Society for Information Science and Technology
Size matters: word count as a measure of quality on wikipedia

Proceedings of the 17th international conference on World Wide Web
One-Class Classification by Combining Density and Class Probability Estimation

ECML PKDD '08 Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases - Part I
Computing trust from revision history

Proceedings of the 2006 International Conference on Privacy, Security and Trust: Bridge the Gap Between PST Technologies and Business Services
Overview and Framework for Data and Information Quality Research

Journal of Data and Information Quality (JDIQ)
User generated content: how good is it?

Proceedings of the 3rd workshop on Information credibility on the web
Automatic quality assessment of content created collaboratively by web communities: a case study of wikipedia

Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Identifying featured articles in wikipedia: writing style matters

Proceedings of the 19th international conference on World wide web
Automatic vandalism detection in Wikipedia

ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
A topic-specific web search system focusing on quality pages

ECDL'10 Proceedings of the 14th European conference on Research and advanced technology for digital libraries
Quality-biased ranking of web documents

Proceedings of the fourth ACM international conference on Web search and data mining
Towards automatic quality assurance in Wikipedia

Proceedings of the 20th international conference companion on World wide web
Detection of text quality flaws as a one-class classification problem

Proceedings of the 20th ACM international conference on Information and knowledge management
A breakdown of quality flaws in Wikipedia

Proceedings of the 2nd Joint WICOW/AIRWeb Workshop on Web Quality

Cluster-based one-class ensemble for classification problems in information retrieval

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Tell me more: an actionable quality model for Wikipedia

Proceedings of the 9th International Symposium on Open Collaboration
Automated decision support for human tasks in a collaborative system: the case of deletion in Wikipedia

Proceedings of the 9th International Symposium on Open Collaboration
What makes a good biography?: multidimensional quality analysis based on wikipedia article feedback data

Proceedings of the 23rd international conference on World wide web

Quantified Score

Hi-index	0.00

Visualization

Abstract

The detection and improvement of low-quality information is a key concern in Web applications that are based on user-generated content; a popular example is the online encyclopedia Wikipedia. Existing research on quality assessment of user-generated content deals with the classification as to whether the content is high-quality or low-quality. This paper goes one step further: it targets the prediction of quality flaws, this way providing specific indications in which respects low-quality content needs improvement. The prediction is based on user-defined cleanup tags, which are commonly used in many Web applications to tag content that has some shortcomings. We apply this approach to the English Wikipedia, which is the largest and most popular user-generated knowledge source on the Web. We present an automatic mining approach to identify the existing cleanup tags, which provides us with a training corpus of labeled Wikipedia articles. We argue that common binary or multiclass classification approaches are ineffective for the prediction of quality flaws and hence cast quality flaw prediction as a one-class classification problem. We develop a quality flaw model and employ a dedicated machine learning approach to predict Wikipedia's most important quality flaws. Since in the Wikipedia setting the acquisition of significant test data is intricate, we analyze the effects of a biased sample selection. In this regard we illustrate the classifier effectiveness as a function of the flaw distribution in order to cope with the unknown (real-world) flaw-specific class imbalances. The flaw prediction performance is evaluated with 10,000 Wikipedia articles that have been tagged with the ten most frequent quality flaws: provided test data with little noise, four flaws can be detected with a precision close to 1.