Structural annotation of eukaryotic genomes in 2nd generation sequencing era

Dal Molin, Alessandra

Nell'ultimo decennio, l'aumento dell’efficienza e la diminuzione del costo delle nuove tecniche di sequenziamento ha portato ad un accumulo di sequenze genomiche nei database pubblici. Con questa enorme quantità di sequenze a disposizione, la necessità di generare delle annotazioni precise e dettagliate non è mai stata così grande. L’annotazione strutturale del genoma è il processo di identificazione di elementi strutturali in una sequenza di DNA, classificandoli in base al loro ruolo biologico. L’approccio computazionale viene sempre più utilizzato per poter eseguire l'annotazione strutturale in maniera automatica, con tempi di esecuzione generalmente brevi e che soddisfano le esigenze ‘high-throughput’ dei progetti di sequenziamento del genoma, anche se sono meno precisi rispetto alla cura manuale, che rimane il 'golden standard' per valutare l’affidabilità dell’annotazione prodotta.Lo scopo di questo progetto è quello di produrre un’annotazione del genoma in maniera veloce ed accurata, in linea con le esigenze attuali, mediante un approccio computazionale applicato a diversi casi sperimentali, in base alle conoscenze biologiche di base e alla natura dei dati di partenza. Per la completezza dello studio, il contributo dei vari metodi utilizzati per produrre l'annotazione finale è stato analizzato insieme alla valutazione della qualità dei dati prodotti.I risultati ottenuti hanno confermato il fatto che la complessità dei genomi eucariotici influisce notevolmente sul processo di annotazione. Una vasta porzione di geni può essere annotata grazie principalmente all’omologia con geni o proteine di altre specie evolutivamente vicine, oppure con l'utilizzo di predittori ab initio ed evidenze sperimentali specie-specifiche. L'integrazione di molteplici evidenze migliora notevolmente l'accuratezza delle annotazioni finali, tuttavia la valutazione della qualità dei risultati e il filtraggio di sequenze poco affidabili, insieme alla cura manuale, sono tuttora necessari per garantire un risultato ottimale.

In the last decade the increase in efficiency and decrease in cost of new sequencing techniques led to a growing amount of genomic sequences in publicdatabases. With this huge volume of sequences being generated from highthroughput sequencing projects, the requirement for providing accurate anddetailed genome annotations has never been greater. Structural genome annotation is the process of identifying structural features in a DNA sequence and classifying them based on their biological role. Computer programs are increasingly used to perform structural annotation since they meet the high-throughput demands of genome sequencing projects even if they are less accurate than manual gene annotation which remains the ‘golden-standard’ for evaluating annotation confidence and quality.The aim of this project is to meet the need of producing fast and accurate genome annotation by applying available computational means to different experimental cases, depending on the biological knowledge achieved so far and the quality of starting data. The contribution of different methods used to produce the final annotation has been analyzed along with the evaluation of results for the completeness of the study.The results obtained showed that the complexity of eukaryotic genomes greatly affects the annotation process; a big fraction of the genes in a genome sequence can be found mostly by homology to other known genes or proteins and by the use of ab initio predictors and species-specific evidence. The integration of multiple sources of annotation greatly improved the accuracy of the final genome annotations, anyway being not error free. Quality assessment of results and filtering of low confidence sequences together with manual revision are Always required to achieve higher accuracy.