Transcription factors (TFs) play a crucial role in gene regulation by binding to short, specific DNA sequences, known as transcription factor binding sites (TFBSs). Accurately identifying these sites is essential for deciphering gene regulatory mechanisms. Several computational approaches, such as position weight matrices (PWMs) or support vector machines (SVMs), have been proposed to model TFBSs. PWMs provide a probabilistic representation of nucleotide frequencies at each position. While simple and interpretable, they assume positional independence, limiting their ability to capture complex sequence dependencies. To address this limitation SVMs offer an alternative by leveraging k-mer frequencies from foreground and background sequences, enhancing predictive accuracy. However, SVMs require large datasets, and well-defined background sequences, which are not always available. In this study we systematically benchmarked PWM and SVM-based models using human ChIP-seq datasets from ENCODE. We evaluated the impact of key factors, including training dataset size, sequence length, background type, and kernel selection (for SVM). SVMs adoption remains limited due to the lack of readily available pretrained models. To bridge this gap, we introduce a curated collection of SVM-trained motif models, providing a valuable resource for TFBS prediction and regulatory genomics research. To systematically assess the robustness and efficacy of PWM and SVM models in TFBSs prediction, we designed a benchmarking framework integrating theoretical and biologically relevant evaluations. We analyzed model adaptability under diverse conditions, considering biological data complexity and noise. For benchmarking we used ChIP-seq peaks from ENCODE for 59 TFs. To evaluate the impact of background sequences in training and evaluation, we employed synthetic and biological (DNase-seq ) background sequences. PWMs were trained using MEME/STREME, while SVM models were generated with LS-GKM. Each model was tested under four conditions. (i) Training/testing on synthetic negatives to establish a baseline against random noise. (ii) Training/testing on DNase-seq backgrounds to simulate biologically relevant conditions. (iii) Training on synthetic negatives and testing on DNase-seq backgrounds to evaluate adaptability to biological complexity. (iv) Training on DNase-seq and testing on synthetic data to assess model retention of generalizability. For tests involving biological backgrounds, we considered two scenarios: a balanced 1:1 positive-to-negative ratio and an imbalanced 1:10 ratio, reflecting real-world genomic conditions where unbound regions outnumber bound sites. These evaluations provided a comprehensive assessment of model generalizability and robustness in TFBS prediction. Our analysis showed that SVMs generally outperform PWMs in predicting TFBSs, particularly on imbalanced datasets. PWMs perform best when trained on narrow, high-quality sequences, while SVMs leverage larger datasets but struggle with limited training data. Model performance is heavily influenced by background data, with biologically relevant backgrounds, such as DNase-seq, significantly improving accuracy, even for PWMs. Analyzing models’ prediction, we found that SVMs localize TFBSs precisely, mapping them to the center of ChIP-seq peaks. Overall, PWMs are better for small, high-quality datasets lacking biological background information, such as SELEX, while SVMs excel with large datasets, where background data are available, such as ChIP-seq.
Benchmarking PWM and SVM-based Models for Transcription Factor Binding Site Prediction: A Comparative Analysis on Synthetic and Biological Data
Manuel Tognon;Alisa Kumbara;Andrea Betti;Lorenzo Ruggeri;Rosalba Giugno
2025-01-01
Abstract
Transcription factors (TFs) play a crucial role in gene regulation by binding to short, specific DNA sequences, known as transcription factor binding sites (TFBSs). Accurately identifying these sites is essential for deciphering gene regulatory mechanisms. Several computational approaches, such as position weight matrices (PWMs) or support vector machines (SVMs), have been proposed to model TFBSs. PWMs provide a probabilistic representation of nucleotide frequencies at each position. While simple and interpretable, they assume positional independence, limiting their ability to capture complex sequence dependencies. To address this limitation SVMs offer an alternative by leveraging k-mer frequencies from foreground and background sequences, enhancing predictive accuracy. However, SVMs require large datasets, and well-defined background sequences, which are not always available. In this study we systematically benchmarked PWM and SVM-based models using human ChIP-seq datasets from ENCODE. We evaluated the impact of key factors, including training dataset size, sequence length, background type, and kernel selection (for SVM). SVMs adoption remains limited due to the lack of readily available pretrained models. To bridge this gap, we introduce a curated collection of SVM-trained motif models, providing a valuable resource for TFBS prediction and regulatory genomics research. To systematically assess the robustness and efficacy of PWM and SVM models in TFBSs prediction, we designed a benchmarking framework integrating theoretical and biologically relevant evaluations. We analyzed model adaptability under diverse conditions, considering biological data complexity and noise. For benchmarking we used ChIP-seq peaks from ENCODE for 59 TFs. To evaluate the impact of background sequences in training and evaluation, we employed synthetic and biological (DNase-seq ) background sequences. PWMs were trained using MEME/STREME, while SVM models were generated with LS-GKM. Each model was tested under four conditions. (i) Training/testing on synthetic negatives to establish a baseline against random noise. (ii) Training/testing on DNase-seq backgrounds to simulate biologically relevant conditions. (iii) Training on synthetic negatives and testing on DNase-seq backgrounds to evaluate adaptability to biological complexity. (iv) Training on DNase-seq and testing on synthetic data to assess model retention of generalizability. For tests involving biological backgrounds, we considered two scenarios: a balanced 1:1 positive-to-negative ratio and an imbalanced 1:10 ratio, reflecting real-world genomic conditions where unbound regions outnumber bound sites. These evaluations provided a comprehensive assessment of model generalizability and robustness in TFBS prediction. Our analysis showed that SVMs generally outperform PWMs in predicting TFBSs, particularly on imbalanced datasets. PWMs perform best when trained on narrow, high-quality sequences, while SVMs leverage larger datasets but struggle with limited training data. Model performance is heavily influenced by background data, with biologically relevant backgrounds, such as DNase-seq, significantly improving accuracy, even for PWMs. Analyzing models’ prediction, we found that SVMs localize TFBSs precisely, mapping them to the center of ChIP-seq peaks. Overall, PWMs are better for small, high-quality datasets lacking biological background information, such as SELEX, while SVMs excel with large datasets, where background data are available, such as ChIP-seq.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.