Strings in Proteomics and Transcriptomics: Algorithmic and Combinatorial Questions in Mass Spectrometry and EST Clustering

Liptak, Zsuzsanna

This thesis treats two problem areas in bioinformatics which can both be benefi- cially formalized as string problems. The first (and larger) part deals with weighted string problems as they arise from biotechnological mass spectrometry applications. In a mass spectrometry experiment—put in a simplified manner—the molecular mass of the sample mo- lecules is determined. The aim is to identify the sample, either with or without additional information from a database, relying on the fact that the different build- ing blocks of proteins (namely, amino acids) and of DNA molecules (nucleotides) have different molecular masses. Viewing proteins and DNA molecules as strings, this leads naturally to the definition of weighted strings as strings over a finite alphabet Σ with an additional weight (or mass) function μ : Σ → R+. We develop some results in weighted string combinatorics, and then present ef- ficient algorithms for three weighted string problems which are motivated by mass spectrometry. First, the mass decomposition problem is the problem of finding all compomers whose mass equals a query mass. Here, a compomer is an integer vector specify- ing the number of occurrences of each character of Σ. A compomer abstracts from the order of characters in a string and identifies instead all strings with the same number of occurrences of each character—since these strings all have the same mass. We also present simulation results and tables detailing the number of com- pomers for different biomolecules, in the ranges appropriate for mass spectrometry applications. Next, we give several efficient algorithms for the submass problem: to test, for a given weighted string s and a query mass M, whether s has a substring with mass M; to find where such a substring occurs; and other variants. We present an algorithm for binary alphabets which runs in time logarithmic in the length of s. Furthermore, we present several algorithms for the problem where multiple masses are sought; these algorithms encode the submasses of s in a polynomial and rely for their efficiency on Fast Fourier Transform for polynomial multiplication. The third weighted string problem we discuss is de novo peptide sequencing, i.e., recovering the amino acid sequence of a sample peptide from tandem MS data, a specialized mass spectrometry experiment. We describe an algorithm which is an enhancement of a dynamic programming algorithm presented by Chen et al. in 2000. We describe our implementation and present some simulation results. The second part of the thesis deals with EST clustering. ESTs (expressed se- quence tags) are short DNA sequences which are partial copies of mRNAs; they are produced in a high-throughput manner and submitted to public databases. In EST clustering, the aim is to produce a partition of a large set of ESTs, where each cluster corresponds to a gene; thus enabling the researcher to identify which genes were being expressed. Clearly, the quality of the clustering is highly dependent on the dissimilarity measure (distance/similarity measure) for strings and on the clustering algorithm employed. We have developed a method for evaluating differ- ent string dissimilarity measures and clustering algorithms. Finally, we present simulation results for five dissimilarity measures and one clustering algorithm.