Evaluation of BIC and Cross Validation for model selection on sequence segmentations

Authors:
Niina Haiminen;Heikki Mannila
Affiliations:
HIIT, University of Helsinki and Helsinki University of Technology, P.O. Box 68, FI-00014 University of Helsinki, Finland.;HIIT, University of Helsinki and Helsinki University of Technology, P.O. Box 68, FI-00014 University of Helsinki, Finland
Venue:
International Journal of Data Mining and Bioinformatics
Year:
2010

Citing 7
Cited 0

On the approximation of curves by line segments using dynamic programming

Communications of the ACM
DNA segmentation as a model selection process

RECOMB '01 Proceedings of the fifth annual international conference on Computational biology
Model selection for probabilistic clustering using cross-validatedlikelihood

Statistics and Computing
Minimum Message Length Segmentation

PAKDD '98 Proceedings of the Second Pacific-Asia Conference on Research and Development in Knowledge Discovery and Data Mining
Estimating the number of segments in time series data using permutation tests

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Aggregating time partitions

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data

Bioinformatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Segmentation is a general data mining technique for summarising and analysing sequential data. Segmentation can be applied, e.g., when studying large-scale genomic structures such as isochores. Choosing the number of segments remains a challenging question. We present extensive experimental studies on model selection techniques, Bayesian Information Criterion (BIC) and Cross Validation (CV). We successfully identify segments with different means or variances, and demonstrate the effect of linear trends and outliers, frequently occurring in real data. Results are given for real DNA sequences with respect to changes in their codon, G + C, and bigram frequencies, and copy-number variation from CGH data.