A Method for Evaluating the Quality of String Dissimilarity Measures and Clustering Algorithms for EST Clustering

Zimmermann, Judith; Hazelhurst, Scott; Liptak, Zsuzsanna

doi:10.1109/BIBE.2004.1317357

We present a method for evaluating the suitability of different string dissimilarity measures and clustering algo- rithms for EST clustering, one of the main techniques used in transcriptome projects. The method comprises gener- ating simulated ESTs with user-specified parameters, and then evaluating the quality of clusterings produced when different dissimilarity measures and different clustering al- gorithms are used. We implemented two tools to do this: ESTSim (EST Simulator), which generates simulated EST sequences from mRNAs/cDNAs using user-specified param- eters, and ECLEST (Evaluator for CLusterings of ESTs), which computes and evaluates a clustering of a set of in- put ESTs, where the dissimilarity measure, the clustering algorithm, and the clustering validity index can be speci- fied independently. We demonstrate the method on a sample of 699 cDNAs, generating approximately 16,000 simulated ESTs. We conducted two experiments and derived statisti- cally significant results from this study comparing subword- based dissimilarity measures to alignment-based ones.