User edits classification using document revision histories

Authors:
Amit Bronner;Christof Monz
Affiliations:
Informatics Institute University of Amsterdam;Informatics Institute University of Amsterdam
Venue:
EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
Year:
2012

Citing 17
Cited 2

Support-Vector Networks

Machine Learning
Random Forests

Machine Learning
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Edit Distance with Move Operations

CPM '02 Proceedings of the 13th Annual Symposium on Combinatorial Pattern Matching
Studying cooperation and conflict between authors with history flow visualizations

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
He says, she says: conflict and coordination in Wikipedia

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Mining wikipedia revision histories for improving sentence compression

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Choosing the right translation: a syntactically informed classification approach

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Predicting the fluency of text with shallow structural features: case studies of machine translation and human-written text

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
The WEKA data mining software: an update

ACM SIGKDD Explorations Newsletter
Paraphrase recognition using machine learning to combine similarity measures

ACLstudent '09 Proceedings of the ACL-IJCNLP 2009 Student Research Workshop
Detecting Wikipedia vandalism via spatio-temporal analysis of revision metadata?

Proceedings of the Third European Workshop on System Security
For the sake of simplicity: unsupervised extraction of lexical simplifications from Wikipedia

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Using the past to score the present: extending term weighting models through revision history analysis

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
A survey of paraphrasing and textual entailment methods

Journal of Artificial Intelligence Research
Learning to simplify sentences with quasi-synchronous grammar and integer programming

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Term weighting based on document revision history

Journal of the American Society for Information Science and Technology

CoSyne: synchronizing multilingual wiki content

Proceedings of the Eighth Annual International Symposium on Wikis and Open Collaboration
A new data hiding method via revision history records on collaborative writing platforms

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Document revision histories are a useful and abundant source of data for natural language processing, but selecting relevant data for the task at hand is not trivial. In this paper we introduce a scalable approach for automatically distinguishing between factual and fluency edits in document revision histories. The approach is based on supervised machine learning using language model probabilities, string similarity measured over different representations of user edits, comparison of part-of-speech tags and named entities, and a set of adaptive features extracted from large amounts of unlabeled user edits. Applied to contiguous edit segments, our method achieves statistically significant improvements over a simple yet effective edit-distance baseline. It reaches high classification accuracy (88%) and is shown to generalize to additional sets of unseen data.