A Framework to Generate Synthetic Multi-label Datasets

Authors:
Jimena Torres Tomás;Newton Spolaôr;Everton Alvares Cherman;Maria Carolina Monard
Affiliations:
Laboratory of Computational Intelligence, Institute of Mathematics and Computer Science, University of São Paulo, 13560-970 São Carlos, SP, Brazil;Laboratory of Computational Intelligence, Institute of Mathematics and Computer Science, University of São Paulo, 13560-970 São Carlos, SP, Brazil;Laboratory of Computational Intelligence, Institute of Mathematics and Computer Science, University of São Paulo, 13560-970 São Carlos, SP, Brazil;Laboratory of Computational Intelligence, Institute of Mathematics and Computer Science, University of São Paulo, 13560-970 São Carlos, SP, Brazil
Venue:
Electronic Notes in Theoretical Computer Science (ENTCS)
Year:
2014

Citing 9
Cited 0

Editorial

Machine Learning
Crafting Papers on Machine Learning

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Knowledge Discovery in Multi-label Phenotype Data

PKDD '01 Proceedings of the 5th European Conference on Principles of Data Mining and Knowledge Discovery
An Interval Classifier for Database Mining Applications

VLDB '92 Proceedings of the 18th International Conference on Very Large Data Bases
Computational Methods of Feature Selection (Chapman & Hall/Crc Data Mining and Knowledge Discovery Series)

Computational Methods of Feature Selection (Chapman & Hall/Crc Data Mining and Knowledge Discovery Series)
An Empirical Study of Lazy Multilabel Classification Algorithms

SETN '08 Proceedings of the 5th Hellenic conference on Artificial Intelligence: Theories, Models and Applications
Feature selection for multi-label naive Bayes classification

Information Sciences: an International Journal
MMDT: a multi-valued and multi-labeled decision tree classifier for data mining

Expert Systems with Applications: An International Journal
MULAN: A Java Library for Multi-Label Learning

The Journal of Machine Learning Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

A controlled environment based on known properties of the dataset used by a learning algorithm is useful to empirically evaluate machine learning algorithms. Synthetic (artificial) datasets are used for this purpose. Although there are publicly available frameworks to generate synthetic single-label datasets, this is not the case for multi-label datasets, in which each instance is associated with a set of labels usually correlated. This work presents Mldatagen, a multi-label dataset generator framework we have implemented, which is publicly available to the community. Currently, two strategies have been implemented in Mldatagen: hypersphere and hypercube. For each label in the multi-label dataset, these strategies randomly generate a geometric shape (hypersphere or hypercube), which is populated with points (instances) randomly generated. Afterwards, each instance is labeled according to the shapes it belongs to, which defines its multi-label. Experiments with a multi-label classification algorithm in six synthetic datasets illustrate the use of Mldatagen.