SAICSIT '04 Proceedings of the 2004 annual research conference of the South African institute of computer scientists and information technologists on IT research in developing countries
Hi-index | 0.00 |
We present a method for evaluating the suitability ofdifferent string dissimilarity measures and clustering algorithmsfor EST clustering, one of the main techniques usedin transcriptome projects. The method comprises generatingsimulated ESTs with user-specified parameters, andthen evaluating the quality of clusterings produced whendifferent dissimilarity measures and different clustering algorithmsare used. We implemented two tools to do this:ESTSim (EST Simulator), which generates simulated ESTsequences from mRNAs/cDNAs using user-specified parameters,and ECLEST (Evaluator for CLusterings of ESTs),which computes and evaluates a clustering of a set of inputESTs, where the dissimilarity measure, the clusteringalgorithm, and the clustering validity index can be specifiedindependently. We demonstrate the method on a sampleof 699 cDNAs, generating approximately 16,000 simulatedESTs. We conducted two experiments and derived statisticallysignificant results from this study comparing subword-baseddissimilarity measures to alignment-based ones.