Genome classification by dictionary-based indexes

Castellini, A.; Manca, V.; Bicego, M.; Compri, S.; Marino, V.; Tosadori, G.

A major application of bioinformatics concerns with the analysis of the full genomes of organisms that have been sequenced from the late 1990s. Several techniques of genome analysis are based on sequence alignment, structure prediction, phylogenomics, gene prediction and other biology-driven approaches [2]. In [1] a new approach has been proposed for genome analysis and comparative genomics which takes its roots in text analysis and information theory. The aim is to provide sets of linguistic/informational indexes able to characterize genome properties which are relevant in specific biological contexts. Here we determine a set of indexes having the capability of discriminating genomes, in almost full accordance with the domains to which their organisms belong.