Comparison of DECODE with previous best methods.
Open access peer-reviewed article
This Article is part of Bioinformatics Section
Article metrics overview
224 Article Downloads
View Full Metrics
Article Type: Research Paper
Date of acceptance: March 2024
Date of publication: June 2024
DoI: 10.5772/dmht.26
copyright: ©2024 The Author(s), Licensee IntechOpen, License: CC BY 4.0
It is becoming clear that bulk gene expression measurements represent an average over very different cells. Elucidating the expression and abundance of each of the encompassed cells is key to disease understanding and precision medicine approaches. A first step in any such deconvolution is the inference of cell type abundances in the given mixture. Numerous approaches to cell-type deconvolution have been proposed, yet very few take advantage of the emerging discipline of deep learning and most approaches are limited to input data regarding the expression profiles of the cell types in question. Here we present DECODE, a deep learning method for the task that is data-driven and does not depend on input expression profiles. DECODE builds on a deep unfolded non-negative matrix factorization technique. It is shown to outperform previous approaches on a range of synthetic and real data sets, producing abundance estimates that are closer to and better correlated with the real values.
deconvolution
bulk gene expression
non-negative matrix factorization
deep learning
Author information
Biological tissues are composed of a variety of distinct cell types. Identifying the composition of cells in tissues can help generate hypotheses regarding cell-type-specific biological mechanisms with important biomedical applications. For example, patients with a large number of infiltrating T cells are more likely to respond positively to immunotherapy [1]. Thus, there is a need for deconvolving a tissue of interest to its constituent cells.
Flow cytometry is the main standard for experimental deconvolution of a sample. More recently, single-cell RNA sequencing (scRNA-seq) methods have become available. However, these methods have their limitations: Flow cytometry requires prompt and careful processing of samples as well as tissue disaggregation, which may result in the loss of fragile cell types and the distortion of gene expression profiles. ScRNA-seq methods are expensive for large sample studies. Additionally, in these technologies, cell types such as neurons, myocytes, and adipocytes are difficult to be captured due to cell size and morphology.
Thus, several computational methods were suggested for predicting cell fractions from bulk expression data. Most methods rely on a signature matrix of cell-specific expression profiles to predict the cell type abundance. Recent comparative analyses of deconvolution methods [2–10] have highlighted state-of-the-art methods for this task including non-negative least squares (NNLS) [11], CIBERSORT [12], CIBERSORTx [10] which are based on support vector regression and GEDIT [13], which runs linear regression, and SCADEN [9] which employs a deep learning approach. However, most of these methods heavily rely on the input signature matrices, which are global matrices that do not contain information which is specific to the input tissue. Furthermore, most of these methods employ classical regression approaches and do not make use of the rich expressive power of deep models that are expected to have a considerable advantage as more training data become available.
We propose DECODE (DEep Cell-type DEconvolution), a novel deep-learning algorithm to predict the cell type abundance matrix from bulk gene expression data and signature matrix. The algorithm is based on a deep unfolding algorithm for non-negative matrix factorization (NMF) and combines both supervised learning on synthetic data and unsupervised learning to achieve its task. We benchmark DECODE using both simulated and real datasets and show that it outperforms previous approaches. DECODE introduces several key novelties that explain its good performance: (i) signature matrices are not explicitly represented by the model but only used to initialize the model and to generate training data, thus allowing data-driven behavior; moreover, (ii) NMF techniques for simultaneous prediction of cell fractions and signatures cannot be directly used for this problem since they do not guarantee that cell fraction vectors will sum to one, while DECODE can be adjusted to this constraint as it is based on a flexible neural network architecture; (iii) the generation of synthetic data (and subsequent training on) overcomes the small amount of available training data; and (iv) the combination of supervised and unsupervised training helps the model tune to these two different goals which are both important on real data as the training is unsupervised but the evaluation is according to true (hidden) cell fractions. DECODE is made available at https://github.com/eranhermush/DECODE.
In gene expression deconvolution, the input is a matrix of bulk gene expression across multiple samples and a signature matrix consisting of expression profiles of specific cell types. The goal is to infer a matrix of cell fractions indicating for each sample its cell-type decomposition. We approach this problem using a deep learning algorithm for NMF that aims to factor the input gene expression matrix into the product of the signature matrix and the cell fraction matrix. Due to scarcity of training data, we train the algorithm parameters using a combination of synthetically generated data and real data. A high level pseudo-code of our algorithm appears in Figure 1 and described in the following subsections.
We assume a bulk expression matrix
Our algorithm, DECODE is based on deep unfolding approach for NMF, DNMF [14]. DNMF is a deep learning algorithm for the decomposition of a non-negative matrix
DNMF has two model variants: supervised and unsupervised. The supervised variant assumes a known fraction matrix
We introduce several novelties into the DNMF architecture and training process. First, to account for the fact that each column
Last, we use a combined supervised and unsupervised training process, where the former is based on synthetic data while the latter is based on both synthetic and real data. We first train the model using synthetic data in a supervised fashion for one epoch with
To deal with data scarcity, we first trained the algorithm on synthetic data. Specifically, we collected known ranges of cell frequencies from [16] and used values within these ranges as parameters for a Dirichlet distribution. We then drew random fraction vectors from this distribution. The resulting fraction matrix was multiplied by the given signature matrix to produce a synthetic bulk expression matrix. We drew multiple matrices in this fashion and fed them sequentially to the training process, viewing each such matrix as representing a batch. We further added a small normally-distributed noise to each bulk expression matrix with zero mean and small standard deviation: for the
In the unsupervised training, we trained the DECODE model (
DECODE has several hyperparameters that need tuning. Due to the small size and number of data sets we opted for using an independent data for tuning the hyperparameters. For number of layers we used 4 for speed considerations, as the model’s accuracy is robust to the specific number used, and we used learning rate of 0.001 [14]. Since the problem we are trying to tackle is unsupervised in nature, we followed [14] and did not use regularization. In order to set the number of training iterations (
Good training data for the deconvolution problem is scarce. Our main data source is a recent benchmark paper [2] which has three available datasets, all are results of
In addition, we retrieved a real, GSE65133, dataset from [12]. This dataset contains 20 samples of real PBMC cell fractions that were measured by flow cytometry.
To complement the expression datasets, we used known signature matrices from [2]. This study contains a comparative analysis of 9 deconvolution methods with respect to 10 signature matrices. Among the top performing methods were CIBERSORT, NNLS and GEDIT which we use for comparison. We averaged the accuracy values reported in Figure 1 of [2] for each of the signature matrices and selected the four signatures with the highest accuracy value: Lm22, Skin Signatures, HPCA-Blood, and BlueCode, which are focused in our study. In detail, LM22 [12] contains 22 cell types and 547 genes; Human Primary Cell Atlas (HPCA-Blood) [17] contains 7 cells and 19,715 genes; Blue-Code [18] contains 34 cells and 13,299 genes; and Skin Signatures [19] contains 21 cells and 20,307 genes.
We preprocess the data using GEDIT’s approach [13] which removes cell types that are not present in either the input expression or signature matrix, runs quantile normalization on both matrices—such that each column follows the same distribution as every other, removes genes that are missing from either matrix, and selects a subset of 50 genes with lowest entropy for each cell to focus on (for each cell we want the genes that are expressed in a cell type-specific manner. Entropy is minimized when expression is detected only in a single cell type).
We used several performance measures to compare DECODE to four existing cell deconvolution algorithms: CIBERSORTX, NNLS, GEDIT and SCADEN. We ran GEDIT with its R source code. We ran CIBERSORTX from its official website (https://cibersortx.stanford.edu/). We ran NNLS with its R function. We ran Scaden with its Python source code and kept its default training datasets (as it does not train with a signature matrix). To compare the performance of the five deconvolution algorithms, we measured both RMSE (root mean squared error) and Pearson correlation coefficient, comparing real and predicted cell fractions estimates.
We designed a novel algorithm for cell-type deconvolution, DECODE, which is based on DNMF method [14] and a novel learning pipeline in which supervised and unsupervised versions of the method are first applied to synthetic data to enhance the learning process. A high level description of the algorithm is shown in Figure 1. A detailed description of the algorithm and its hyperparameter tuning is elucidated in Methods.
To evaluate DECODE, we applied it to three independent test datasets and compared its performance to those of four state-of-the-art approaches: NNLS [11], CIBERSORTX (an updated version of CIBERSORT) [10], GEDIT [13] and SCADEN [9]. As a first test case, we tested DECODE on two simulated datasets of PBMC cells from [2]. The results are summarized in Figure 3 and show the superiority of our approach compared to previous methods with respect to the two most common evaluation metrics—RMSE and Pearson correlation.
As a second test, we applied DECODE to a real dataset of PBMCs from [12], again obtaining favorable results (Figures 4 and 5).
In summary, DECODE significantly improved the results of the former methods. Table 1 shows that DECODE produces much lower RMSE errors than the previous best methods.
Dataset | DECODE RMSE | Previous best result | DECODE improvement (%) |
---|---|---|---|
PBMC1 | 0.0678 | 0.0737 (by SCADEN) | 8 |
PBMC2 | 0.0712 | 0.0936 (by NNLS) | 23 |
Real GSE65133 | 0.1137 | 0.1411 (by SCADEN) | 20 |
We provided a deep learning framework for deconvolution of bulk gene expression to its cell fractions. Its main innovations include the generation of labeled training data and the combination of supervised and unsupervised learning in the training process, as well as the use of DNMF method which does not explicitly code the cell signatures, allowing data-driven behavior. We demonstrated the utility of our framework in deconvolution of simulated and real data.
While DECODE’s methodology does not depend on a signature matrix, such a matrix is used in the initialization of the neural network. Future work includes the inference of the signature matrix as part of the learning process so as not to depend on receiving it as input. Another limitation of DECODE is the use of synthetic data for training due to the scarcity of real data. With the accumulation of single cell expression data, a potential way forward is to use these data to simulate deconvolution scenarios and thus improve the training process.
RS was supported by a joint program grant from the Cancer Biology Research Center (CBRC), Djerassi Oncology Center, Edmond J. Safra Center for Bioinformatics and Tel Aviv University Center for AI and Data Science (TAD).
The authors declare no conflict of interest.
Written by
Article Type: Research Paper
Date of acceptance: March 2024
Date of publication: June 2024
DOI: 10.5772/dmht.26
Copyright: The Author(s), Licensee IntechOpen, License: CC BY 4.0
© The Author(s) 2024. Licensee IntechOpen. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
Impact of this article
224
Downloads
175
Views
1
Altmetric Score
Join us today!
Submit your Article