Motivations. A central problem in biological network analysis is the Local Network Alignment. The aim is to detect conserved subnetworks or complexes of proteins, across two or more species, which are involved in processes or functions. This allows to predict either new interactions or the functions of unknown proteins. Since the problem of finding conserved subnetworks in a set of networks is related to subgraph isomorphism, which is known to be NP-Hard, several heuristics have been proposed. These include PathBlast [1] and MaWISh [2] for pairwise local alignment and NetworkBlast-M [3] and Graemlin [4] for the multiple case. Although NetworkBlast-M has been proved to be the most efficient and accurate method, it is able to find significant conserved complexes composed by no more than 15 proteins. Here we introduce GASOLINE (Greedy And Stochastic algorithm for Optimal Local alignment of Interaction Networks) an algorithm based on Gibbs Sampling [5] in connection to a seed-extend approach to search for significant conserved complexes of any size. Methods.The algorithm consists of two main phases. In the first phase, we look for ortholog proteins across the networks. We call these proteins seeds of the suboptimal pattern. In the second phase, called iterative phase, we extend each seed, by adding one adjacent node. Here, through a stochastic process based on Gibbs Sampling, we choose a node among a set of randomly picked adjacent ones. The chosen nodes will be those that maximize similarity among the N extended seeds. We repeat the iterative phase until we obtain a set of N conserved subgraphs each of size W. These N subgraphs represent our final alignment. The topological density and the conservation of a complex are measured through an Index of Density and Structural Conservation (hereafter IDSC). The IDSC score ranges from 0 to 1 and it is dynamically computed during the iterative phase. The iteration will terminate when IDSC is above a fixed threshold. This allows the removal of parameter W producing a set of highly dense and conserved complexes of different sizes. Results.GASOLINE has been tested on 10 microbial PPI networks, taken from Graemlin [4] and on 6 eukaryotic PPI networks, taken from STRING database [6]. The number of proteins in microbial networks ranges from 1,000 to 7,000 with the amount of interactions ranging from 13,000 to 230,000, whereas the size of eukaryotic networks ranges from 6,000 to 12,500 proteins and from 26,000 to 166,000 edges. 2000 of execution of GASOLINE have been performed. As output we consider the best distinct (not overlapping) complexes, with respect to size andIDSC score. In our experiments we selected complexes of at least 5 proteins with IDSC score > 0.7. All tests have been performed on an Intel Core i5-2500 3.30Ghz CPU with 4 GB RAM. The complexes computed by the algorithm have been then validated by annotating their proteins with GO categories. For the microbial networks, annotations have been downloaded from DAVID [7,8], while eukaryotic networks proteins have been annotated using BioDBNet [9]. Significant conserved categories have been obtained by computing a p-value (< 0.0001), based on hypergeometric distribution. A GO category has been considered conserved when it resulted significant in at least N-1 species, where N is the number of aligned networks. The executions of GASOLINE on microbial networks revealed the existence of a big conserved complex of 40 proteins, forming the large and small subunits of ribosome, with IDSC equals to 0.755 (see Tab. 1). As for the 6 eukaryotic networks, 15 conserved complexes have been found by GASOLINE with IDSC greater than 0.75. They are listed in Tab. 2, with their IDSC and the number of GO categories enriched.
A greedy and stochastic algorithm for multiple local alignment of interaction networks
GIUGNO, ROSALBA;
2012-01-01
Abstract
Motivations. A central problem in biological network analysis is the Local Network Alignment. The aim is to detect conserved subnetworks or complexes of proteins, across two or more species, which are involved in processes or functions. This allows to predict either new interactions or the functions of unknown proteins. Since the problem of finding conserved subnetworks in a set of networks is related to subgraph isomorphism, which is known to be NP-Hard, several heuristics have been proposed. These include PathBlast [1] and MaWISh [2] for pairwise local alignment and NetworkBlast-M [3] and Graemlin [4] for the multiple case. Although NetworkBlast-M has been proved to be the most efficient and accurate method, it is able to find significant conserved complexes composed by no more than 15 proteins. Here we introduce GASOLINE (Greedy And Stochastic algorithm for Optimal Local alignment of Interaction Networks) an algorithm based on Gibbs Sampling [5] in connection to a seed-extend approach to search for significant conserved complexes of any size. Methods.The algorithm consists of two main phases. In the first phase, we look for ortholog proteins across the networks. We call these proteins seeds of the suboptimal pattern. In the second phase, called iterative phase, we extend each seed, by adding one adjacent node. Here, through a stochastic process based on Gibbs Sampling, we choose a node among a set of randomly picked adjacent ones. The chosen nodes will be those that maximize similarity among the N extended seeds. We repeat the iterative phase until we obtain a set of N conserved subgraphs each of size W. These N subgraphs represent our final alignment. The topological density and the conservation of a complex are measured through an Index of Density and Structural Conservation (hereafter IDSC). The IDSC score ranges from 0 to 1 and it is dynamically computed during the iterative phase. The iteration will terminate when IDSC is above a fixed threshold. This allows the removal of parameter W producing a set of highly dense and conserved complexes of different sizes. Results.GASOLINE has been tested on 10 microbial PPI networks, taken from Graemlin [4] and on 6 eukaryotic PPI networks, taken from STRING database [6]. The number of proteins in microbial networks ranges from 1,000 to 7,000 with the amount of interactions ranging from 13,000 to 230,000, whereas the size of eukaryotic networks ranges from 6,000 to 12,500 proteins and from 26,000 to 166,000 edges. 2000 of execution of GASOLINE have been performed. As output we consider the best distinct (not overlapping) complexes, with respect to size andIDSC score. In our experiments we selected complexes of at least 5 proteins with IDSC score > 0.7. All tests have been performed on an Intel Core i5-2500 3.30Ghz CPU with 4 GB RAM. The complexes computed by the algorithm have been then validated by annotating their proteins with GO categories. For the microbial networks, annotations have been downloaded from DAVID [7,8], while eukaryotic networks proteins have been annotated using BioDBNet [9]. Significant conserved categories have been obtained by computing a p-value (< 0.0001), based on hypergeometric distribution. A GO category has been considered conserved when it resulted significant in at least N-1 species, where N is the number of aligned networks. The executions of GASOLINE on microbial networks revealed the existence of a big conserved complex of 40 proteins, forming the large and small subunits of ribosome, with IDSC equals to 0.755 (see Tab. 1). As for the 6 eukaryotic networks, 15 conserved complexes have been found by GASOLINE with IDSC greater than 0.75. They are listed in Tab. 2, with their IDSC and the number of GO categories enriched.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.