Automatic vandalism detection in wikipedia with active associative classification

Authors:
Maria Sumbana;Marcos André Gonçalves;Rodrigo Silva;Jussara Almeida;Adriano Veloso
Affiliations:
Department of Computer Science, Universidade Federal de Minas Gerais, Brazil;Department of Computer Science, Universidade Federal de Minas Gerais, Brazil;Department of Computer Science, Universidade Federal de Minas Gerais, Brazil;Department of Computer Science, Universidade Federal de Minas Gerais, Brazil;Department of Computer Science, Universidade Federal de Minas Gerais, Brazil
Venue:
TPDL'12 Proceedings of the Second international conference on Theory and Practice of Digital Libraries
Year:
2012

Citing 22
Cited 2

Rational series and their languages

Rational series and their languages
On the synthesis of a reactive module

POPL '89 Proceedings of the 16th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Economy of description for single-valued transducers

Information and Computation
Squaring transducers: an efficient procedure for deciding functionality and sequentiality

Theoretical Computer Science
Sur les relations rationnelles

Proceedings of the 2nd GI Conference on Automata Theory and Formal Languages
Deciding unambiguity and sequentiality from a finitely ambiguous max-plus automaton

Theoretical Computer Science - Developments in language theory
Computation: finite and infinite machines

Computation: finite and infinite machines
Model checking discounted temporal properties

Theoretical Computer Science - Tools and algorithms for the construction and analysis of systems (TACAS 2004)
Lazy Associative Classification

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Weighted automata and weighted logics with discounting

Theoretical Computer Science
Handbook of Weighted Automata

Handbook of Weighted Automata
Automatic vandalism detection in Wikipedia

ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
Modern Information Retrieval

Modern Information Retrieval
Church's problem and a tour through automata theory

Pillars of computer science
Quantitative languages

ACM Transactions on Computational Logic (TOCL)
Crowdsourcing a wikipedia vandalism corpus

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Rule-based active sampling for learning to rank

ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part III
Vandalism detection in Wikipedia: a high-performing, feature-rich model and its reduction through Lasso

Proceedings of the 7th International Symposium on Wikis and Open Collaboration
What's decidable about weighted automata?

ATVA'11 Proceedings of the 9th international conference on Automated technology for verification and analysis
Rigorous Approximated Determinization of Weighted Automata

LICS '11 Proceedings of the 2011 IEEE 26th Annual Symposium on Logic in Computer Science
Temporal Specifications with Accumulative Values

LICS '11 Proceedings of the 2011 IEEE 26th Annual Symposium on Logic in Computer Science
On intersection problems for polynomially generated sets

ICALP'06 Proceedings of the 33rd international conference on Automata, Languages and Programming - Volume Part II

Parameterized weighted containment

FOSSACS'13 Proceedings of the 16th international conference on Foundations of Software Science and Computation Structures
A note on the approximation of mean-payoff games

Information Processing Letters

Quantified Score

Hi-index	0.00

Visualization

Abstract

Wikipedia and other free editing services for collaboratively generated content have quickly grown in popularity. However, the lack of editing control has made these services vulnerable to various types of malicious actions such as vandalism. State-of-the-art vandalism detection methods are based on supervised techniques, thus relying on the availability of large and representative training collections. Building such collections, often with the help of crowdsourcing, is very costly due to a natural skew towards very few vandalism examples in the available data as well as dynamic patterns. Aiming at reducing the cost of building such collections, we present a new active sampling technique coupled with an on-demand associative classification algorithm for Wikipedia vandalism detection. We show that our classifier enhanced with a simple undersampling technique for building the training set outperforms state-of-the-art classifiers such as SVMs and kNNs. Furthermore, by applying active sampling, we are able to reduce the need for training in almost 96% with only a small impact on detection results.