We present a method for evaluating the suitability of different string dissimilarity measures and clustering algo- rithms for EST clustering, one of the main techniques used in transcriptome projects. The method comprises gener- ating simulated ESTs with user-specified parameters, and then evaluating the quality of clusterings produced when different dissimilarity measures and different clustering al- gorithms are used. We implemented two tools to do this: ESTSim (EST Simulator), which generates simulated EST sequences from mRNAs/cDNAs using user-specified param- eters, and ECLEST (Evaluator for CLusterings of ESTs), which computes and evaluates a clustering of a set of in- put ESTs, where the dissimilarity measure, the clustering algorithm, and the clustering validity index can be speci- fied independently. We demonstrate the method on a sample of 699 cDNAs, generating approximately 16,000 simulated ESTs. We conducted two experiments and derived statisti- cally significant results from this study comparing subword- based dissimilarity measures to alignment-based ones.

A Method for Evaluating the Quality of String Dissimilarity Measures and Clustering Algorithms for EST Clustering

Liptak, Zsuzsanna
2004

Abstract

We present a method for evaluating the suitability of different string dissimilarity measures and clustering algo- rithms for EST clustering, one of the main techniques used in transcriptome projects. The method comprises gener- ating simulated ESTs with user-specified parameters, and then evaluating the quality of clusterings produced when different dissimilarity measures and different clustering al- gorithms are used. We implemented two tools to do this: ESTSim (EST Simulator), which generates simulated EST sequences from mRNAs/cDNAs using user-specified param- eters, and ECLEST (Evaluator for CLusterings of ESTs), which computes and evaluates a clustering of a set of in- put ESTs, where the dissimilarity measure, the clustering algorithm, and the clustering validity index can be speci- fied independently. We demonstrate the method on a sample of 699 cDNAs, generating approximately 16,000 simulated ESTs. We conducted two experiments and derived statisti- cally significant results from this study comparing subword- based dissimilarity measures to alignment-based ones.
9780769521732
string similarity and dissimilarity measures; EST clustering; transcriptome; simulated data; benchmarks
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: http://hdl.handle.net/11562/391096
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 3
  • ???jsp.display-item.citation.isi??? 1
social impact