Hidden Markov models approach to the analysis of array CGH data
Journal of Multivariate Analysis
Quantile smoothing of array CGH data
Bioinformatics
Exploring the state sequence space for hidden Markov and semi-Markov chains
Computational Statistics & Data Analysis
Model-based clustering of array CGH data
Bioinformatics
Inference in Hidden Markov Models
Inference in Hidden Markov Models
Implied distributions in multiple change point problems
Statistics and Computing
Error bounds for convolutional codes and an asymptotically optimum decoding algorithm
IEEE Transactions on Information Theory
Hi-index | 0.03 |
The detection of change-points in heterogeneous sequences is a statistical challenge with applications across a wide variety of fields. In bioinformatics, a vast amount of methodology exists to identify an ideal set of change-points for detecting Copy Number Variation (CNV). While considerable efficient algorithms are currently available for finding the best segmentation of the data in CNV, relatively few approaches consider the important problem of assessing the uncertainty of the change-point location. Asymptotic and stochastic approaches exist but often require additional model assumptions to speed up the computations, while exact methods generally have quadratic complexity which may be intractable for large data sets of tens of thousands points or more. A hidden Markov model, with constraints specifically chosen to correspond to a segment-based change-point model, provides an exact method for obtaining the posterior distribution of change-points with linear complexity. The methods are implemented in the R package postCP, which uses the results of a given change-point detection algorithm to estimate the probability that each observation is a change-point. The results include an implementation of postCP on a publicly available CNV data set (n=120). Due to its frequentist framework, postCP obtains less conservative confidence intervals than previously published Bayesian methods, but with linear complexity instead of quadratic. Simulations showed that postCP provided comparable loss to a Bayesian MCMC method when estimating posterior means, specifically when assessing larger scale changes, while being more computationally efficient. On another high-resolution CNV data set (n=14,241), the implementation processed information in less than one second on a mid-range laptop computer.