A semi-supervised regression model for mixed numerical and categorical variables

Authors:
Michael K. Ng;Elaine Y. Chan;Meko M. C. So;Wai-Ki Ching
Affiliations:
Department of Mathematics, Hong Kong Baptist University, Kowloon Tong, Hong Kong;Department of Mathematics, The University of Hong Kong, Pokfulam Road, Hong Kong;School of Management, The University of Southampton, Highfield, Southampton, SO17 1BJ, UK;Department of Mathematics, The University of Hong Kong, Pokfulam Road, Hong Kong
Venue:
Pattern Recognition
Year:
2007

Citing 7
Cited 2

Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values

Data Mining and Knowledge Discovery
A Note on K-modes Clustering

Journal of Classification
Using Multivariate Statistics (5th Edition)

Using Multivariate Statistics (5th Edition)
A fuzzy k-modes algorithm for clustering categorical data

IEEE Transactions on Fuzzy Systems
A Convergence Theorem for the Fuzzy ISODATA Clustering Algorithms

IEEE Transactions on Pattern Analysis and Machine Intelligence

A comparison of different rule-based statistical models for modeling geogenic groundwater contamination

Environmental Modelling & Software
Large margin classifiers and Random Forests for integrated biological prediction

International Journal of Bioinformatics Research and Applications

Quantified Score

Hi-index	0.01

Visualization

Abstract

In this paper, we develop a semi-supervised regression algorithm to analyze data sets which contain both categorical and numerical attributes. This algorithm partitions the data sets into several clusters and at the same time fits a multivariate regression model to each cluster. This framework allows one to incorporate both multivariate regression models for numerical variables (supervised learning methods) and k-mode clustering algorithms for categorical variables (unsupervised learning methods). The estimates of regression models and k-mode parameters can be obtained simultaneously by minimizing a function which is the weighted sum of the least-square errors in the multivariate regression models and the dissimilarity measures among the categorical variables. Both synthetic and real data sets are presented to demonstrate the effectiveness of the proposed method.