Transcription factors are key regulatory proteins that promotes or reduce the expression of genes by binding short (7-20 bp) DNA sequences known as transcription factors binding sites (TFBS). TFBS can be summarized using Position Weighted Matrices (PWMs), which encode the probability of observing a given nucleotide in a given position of a binding site. Recently, several studies showed that mutations occurring in regulatory motifs can enhance, weaken or even create new binding sites. Moreover, mutations altering TFBS can occur in haplotypes conserved within a population or even private to a single individual. While several tools have been developed to scan for potential motif occurrences on reference genome sequences, no tool exists to find them in pangenome variation graphs (VGs). VGs are sequence-labelled graphs that can efficiently encode collections of genomes and their genetic variants in a single efficient data structure. Because VGs can losslessly compress large genomes from large panels of individuals, TFBS scanning in VGs can efficiently capture how genomic variation affects the potential binding landscape of TFs in a population of individuals. Here we present GRAFIMO (GRAph-based Finding of Individual Motif Occurrences), a command-line tool for the scanning of known TF DNA motifs represented as Position Weight Matrices (PWMs) in VGs. GRAFIMO extends the standard PWM scanning procedure by offering a variant- and haplotype-aware search for TFBS in a VG.

GRAFIMO: variant and haplotype aware motif scanning on pangenome graphs

Tognon Manuel
;
Bonnici Vincenzo;Giugno Rosalba;
2021

Abstract

Transcription factors are key regulatory proteins that promotes or reduce the expression of genes by binding short (7-20 bp) DNA sequences known as transcription factors binding sites (TFBS). TFBS can be summarized using Position Weighted Matrices (PWMs), which encode the probability of observing a given nucleotide in a given position of a binding site. Recently, several studies showed that mutations occurring in regulatory motifs can enhance, weaken or even create new binding sites. Moreover, mutations altering TFBS can occur in haplotypes conserved within a population or even private to a single individual. While several tools have been developed to scan for potential motif occurrences on reference genome sequences, no tool exists to find them in pangenome variation graphs (VGs). VGs are sequence-labelled graphs that can efficiently encode collections of genomes and their genetic variants in a single efficient data structure. Because VGs can losslessly compress large genomes from large panels of individuals, TFBS scanning in VGs can efficiently capture how genomic variation affects the potential binding landscape of TFs in a population of individuals. Here we present GRAFIMO (GRAph-based Finding of Individual Motif Occurrences), a command-line tool for the scanning of known TF DNA motifs represented as Position Weight Matrices (PWMs) in VGs. GRAFIMO extends the standard PWM scanning procedure by offering a variant- and haplotype-aware search for TFBS in a VG.
Computational biology
Bioinformatics
Motif scanning
Transcription Factor
Genome graphs
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: http://hdl.handle.net/11562/1049089
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact