Over the past two decades, many methodologies to infer proportions of individual cell types from bulk transcriptomics data have emerged, along with new methods that use single-cell RNA-sequencing data to infer cell proportions in bulk RNA-sequenced samples. There are many challenges that the development of these methods must face. First the necessity to build reference datasets on standardised and state-of-the-art computational tools, then the standardisation of cell type annotation and marker selection and, finally, the necessity to improve both algorithm and signature atlas generalizability to new bulk sample conditions. The first step to tackle some of those challenges is to devise a single-cell reference panel that will allow us to provide a standardised resource, with a consistent annotation method, to function as ground truth for the deconvolution algorithm. The final goal is to obtain deconvoluted data at the cell type level from bulk gene expression. In this preliminary test, we included two liver-based datasets: GSE149614 (71.915 cells, three non- viral tumour samples, seven HBV or HCV related tumour samples), and a subset of GSE243981 (24.242 cells, six healthy samples), to create a balanced resource without a focus on function/disease. To maximise the standardisation of our workflow we performed marker-based annotation of the integrated panel using the software ScType and a curated list of signatures from GSE149614,GSE243981 and the CellMarker 2.0 database. Deconvolution was executed using the β-VAE implementation provided by the Bulk2Space software, generating single-cell-like expression data. Various types of bulk RNA-seq data were utilised to assess the resource usability: two normal liver samples (one from GTEx, one from an internal dataset), one liver tumour sample (internal dataset) and one Primary Human Hepatocyte sample from liver resection. Integration of the two datasets was performed using Seurat/harmony and resulted in a panel of 96.159 cells and 16 samples. Annotation consisted of a 2-step approach: initial annotation by main cell type followed by subtype identification for each cell type. To evaluate the results of the integration procedure we compared the cell types/cell subtypes identified by our annotation approach to the original annotation provided in the respective dataset, obtaining 93% and 79% concordance on cell types for GSE149614 and GSE243981 respectively, while on cell subtypes, we obtained a concordance of 46% and 86% respectively. Moreover, the deconvolution algorithm transferred the correct cell type label (Hepatocytes) to the deconvoluted RNA-seq data. With this first attempt, we demonstrate the feasibility of leveraging atlas-level characteristics of a single-cell reference to perform bulk RNA-seq deconvolution retaining the relevant cell type information. Furthermore, we established a standardised workflow for dataset integration and annotation. Future work will expand the resource variability representation by including more datasets. In addition, the implementation of an Adversarial Autoencoder network instead of the actual β-VAE could improve deconvolution quality and increase the resolution of the label transfer (from cell types to cell subtypes).

Empowering bulk RNA-seq deconvolution algorithms by integrating multiple transcriptomics datasets

Martina GALLINARO
;
Giovanni MALERBA;
2024-01-01

Abstract

Over the past two decades, many methodologies to infer proportions of individual cell types from bulk transcriptomics data have emerged, along with new methods that use single-cell RNA-sequencing data to infer cell proportions in bulk RNA-sequenced samples. There are many challenges that the development of these methods must face. First the necessity to build reference datasets on standardised and state-of-the-art computational tools, then the standardisation of cell type annotation and marker selection and, finally, the necessity to improve both algorithm and signature atlas generalizability to new bulk sample conditions. The first step to tackle some of those challenges is to devise a single-cell reference panel that will allow us to provide a standardised resource, with a consistent annotation method, to function as ground truth for the deconvolution algorithm. The final goal is to obtain deconvoluted data at the cell type level from bulk gene expression. In this preliminary test, we included two liver-based datasets: GSE149614 (71.915 cells, three non- viral tumour samples, seven HBV or HCV related tumour samples), and a subset of GSE243981 (24.242 cells, six healthy samples), to create a balanced resource without a focus on function/disease. To maximise the standardisation of our workflow we performed marker-based annotation of the integrated panel using the software ScType and a curated list of signatures from GSE149614,GSE243981 and the CellMarker 2.0 database. Deconvolution was executed using the β-VAE implementation provided by the Bulk2Space software, generating single-cell-like expression data. Various types of bulk RNA-seq data were utilised to assess the resource usability: two normal liver samples (one from GTEx, one from an internal dataset), one liver tumour sample (internal dataset) and one Primary Human Hepatocyte sample from liver resection. Integration of the two datasets was performed using Seurat/harmony and resulted in a panel of 96.159 cells and 16 samples. Annotation consisted of a 2-step approach: initial annotation by main cell type followed by subtype identification for each cell type. To evaluate the results of the integration procedure we compared the cell types/cell subtypes identified by our annotation approach to the original annotation provided in the respective dataset, obtaining 93% and 79% concordance on cell types for GSE149614 and GSE243981 respectively, while on cell subtypes, we obtained a concordance of 46% and 86% respectively. Moreover, the deconvolution algorithm transferred the correct cell type label (Hepatocytes) to the deconvoluted RNA-seq data. With this first attempt, we demonstrate the feasibility of leveraging atlas-level characteristics of a single-cell reference to perform bulk RNA-seq deconvolution retaining the relevant cell type information. Furthermore, we established a standardised workflow for dataset integration and annotation. Future work will expand the resource variability representation by including more datasets. In addition, the implementation of an Adversarial Autoencoder network instead of the actual β-VAE could improve deconvolution quality and increase the resolution of the label transfer (from cell types to cell subtypes).
2024
Single-cell, scRNA-seq, dataset integration, bulk RNAseq, deconvolution
File in questo prodotto:
File Dimensione Formato  
poster155.JOBIM.2024_Gallinaro_et_al.pdf

accesso aperto

Licenza: Dominio pubblico
Dimensione 1.97 MB
Formato Adobe PDF
1.97 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11562/1130546
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact