What makes a good biography?: multidimensional quality analysis based on wikipedia article feedback data

Authors:
Lucie Flekova;Oliver Ferschke;Iryna Gurevych
Affiliations:
Ubiquitous Knowledge Processing Lab (UKP-DIPF) German Institute for Educational Research and Educational Information, Frankfurt am Main, Germany;Ubiquitous Knowledge Processing Lab (UKP-TUDA) Department of Computer Science, Technische Universität Darmstadt, Darmstadt, Germany;Ubiquitous Knowledge Processing Lab (UKP-TUDA) Department of Computer Science, Technische Universität Darmstadt, Darmstadt, Germany
Venue:
Proceedings of the 23rd international conference on World wide web
Year:
2014

Citing 20
Cited 0

What makes Web sites credible?: a report on a large quantitative study

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
A decision-theoretic generalization of on-line learning and an application to boosting

EuroCOLT '95 Proceedings of the Second European Conference on Computational Learning Theory
Gender-Preferential Text Mining of E-mail Discourse

ACSAC '02 Proceedings of the 18th Annual Computer Security Applications Conference
UIMA: an architectural approach to unstructured information processing in the corporate research environment

Natural Language Engineering
Incorporating non-local information into information extraction systems by Gibbs sampling

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
A framework for information quality assessment

Journal of the American Society for Information Science and Technology
Information quality work organization in wikipedia

Journal of the American Society for Information Science and Technology
Size matters: word count as a measure of quality on wikipedia

Proceedings of the 17th international conference on World Wide Web
Automatically profiling the author of an anonymous text

Communications of the ACM - Inspiring Women in Computing
Automatic quality assessment of content created collaboratively by web communities: a case study of wikipedia

Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Cognitively motivated features for readability assessment

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Identifying featured articles in wikipedia: writing style matters

Proceedings of the 19th international conference on World wide web
Readability assessment for text simplification

IUNLPBEA '10 Proceedings of the NAACL HLT 2010 Fifth Workshop on Innovative Use of NLP for Building Educational Applications
An exploration of relations between visual appeal, trustworthiness and perceived usability of homepages

ACM Transactions on Computer-Human Interaction (TOCHI)
Statistical measure of quality in Wikipedia

Proceedings of the First Workshop on Social Media Analytics
Wikipedia revision toolkit: efficiently accessing Wikipedia's edit history

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Systems Demonstrations
Text Processing with GATE

Text Processing with GATE
Information quality assessment of community generated content: A user study of Wikipedia

Journal of Information Science
A lightweight framework for reproducible parameter sweeping in information retrieval

Proceedings of the 2011 workshop on Data infrastructurEs for supporting information retrieval evaluation
Predicting quality flaws in user-generated content: the case of wikipedia

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

With more than 22 million articles, the largest collaborative knowledge resource never sleeps, experiencing several article edits every second. Over one fifth of these articles describes individual people, the majority of which are still alive. Such articles are, by their nature, prone to corruption and vandalism. Manual quality assurance by experts can barely cope with this massive amount of data. Can it be effectively replaced by feedback from the crowd? Can we provide meaningful support for quality assurance with automated text processing techniques? Which properties of the articles should then play a key role in the machine learning algorithms and why? In this paper, we study the user-perceived quality of Wikipedia articles based on a novel Wikipedia user feedback dataset. In contrast to previous work on quality assessment which mostly relied on judgements of active Wikipedia authors, we analyze ratings of ordinary Wikipedia users along four quality dimensions (Complete, Well written, Trustworthy and Objective). We first present an empirical analysis of the novel dataset with over 36 million Wikipedia article ratings. We then select a subset of biographical articles and perform classification experiments to predict their quality ratings along each of the dimensions, exploring multiple linguistic, surface and network properties of the rated articles. Additionally, we study the classification performance and differences for the biographies of living and dead people as well as those for men and women. We demonstrate the effectiveness of our approach by the F-scores of 0.94, 0.89, 0.73, and 0.73 for the dimensions Complete, Well written, Trustworthy, and Objective. Based on the results, we believe that the quality assessment of big textual data can be effectively supported by current text classification and language processing tools.