Fast analytical methods for finding significant labeled graph motifs

Giovanni, Micale; Giugno, Rosalba; Alfredo, Ferro; Misael, Mongiovì; Dennis, Shasha; Alfredo, Pulvirenti

doi:10.1007/s10618-017-0544-8

Network motif discovery is the problem of finding subgraphs of a network that occur more frequently than expected, according to some reasonable null hypothesis. Such subgraphs may indicate small scale interaction features in genomic interaction networks or intriguing relationships involving actors or a relationship among airlines.When nodes are labeled, they can carry information such as the genomic entity under study or the dominant genre of an actor. For that reason, labeled subgraphs convey information beyond structure and could therefore enjoy more applications. To identify statistically significant motifs in a given network, we propose an analytical method (i.e. simulation-free) that extends the works of Picard et al. (J Comput Biol 15(1):1–20, 2008) and Schbath et al. (J Bioinform Syst Biol 2009(1):616234, 2009) to label-dependent scale-free graph models. We provide an analytical expression of the mean and variance of the count under the Expected Degree Distribution random graph model. Our model deals with both induced and non-induced motifs. We have tested our methodology on a wide set of graphs ranging from protein–protein interaction networks to movie networks. The analytical model is a fast (usually faster by orders of magnitude) alternative to simulation. This advantage increases as graphs grow in size.