CELLiD
Last updated
Last updated
The accuracy of an automated cell type annotator is highly contingent on the quality of the underlying reference dataset. Therefore, we leverage our own Atlas which we have manually annotated, using the as a guide for naming conventions. We also constructed a to illustrate the relationship between different cell types based on their gene expression profiles.
With the DISCO Atlas as our reference dataset, we developed CELLiD. CELLiD uses the average expression levels of all genes in a cell cluster as the input. Two rounds of prediction are then carried out. In the first round (coarse grain stage), the Spearman correlations are computed using all overlapping genes between each input cell cluster and every cell type in the reference dataset. A higher correlation value would imply greater similarity in terms of gene expression values. We retain the top 20 ranked cell types for the next stage. We chose to use Spearman’s Rho over Pearson’s correlation coefficient because the former can accommodate non-linear relationships in the expression values.
In the second round (fine grain stage), the top 3000 highly variable genes (HVGs) in the retained cell types are used to re-calculate the Spearman correlations. The two cell types with the highest correlation coefficients are reported.
By splitting across 2 predictions with increasing granularity, the accuracy of CELLiD is improved. As the underlying Atlas is expanded and updated, CELLiD will also take the new data into account, further increasing its accuracy.
CELLiD requires 2 inputs from the user:
A table containing the average gene expression values for each cluster provided in this format:
Selection of a reference dataset
Please note:
Columns can be split by Tab (\t), Comma (,) or Spaces ( )
Header rows are not required.
Please use all genes for a better prediction.
We recommend using normalized gene expression values as input.
If the table is generated using Excel, all excel cells need to be formatted as Text.
Input table can be generated from Seurat objects using the script below.
A filter is provided in the input data section to select the reference dataset.
This will restrict CELLiD to the selected reference datasets when calculating the correlation between the user’s input clusters and reference cell types.
User provides gene expression data and selects the reference dataset of interest. CELLiD begins by selecting the first cluster for prediction.
For each cell type present in the atlas, take the overlapping genes and compute a Spearman Correlation with the gene expression values. A dummy dataset is used for explanation purposes here.
For the retained cell types, a second round of prediction is done using the HVGs. For simplicity, only 25 of the 3000 HVGs are shown here.
This workflow is repeated for all clusters and the results are generated.
The results are organized as follows:
Input ID – cluster number
Primary/Secondary Prediction – the top two most likely cell types CELLiD has identified the cluster to be, based on the pipeline above.
Primary/Secondary Tissue – the likely tissue origin of the cluster, based on the specific tissue atlas that was used by CELLiD to identify the cluster. To restrict the atlases used by CELLiD to determine cluster identity, the reference dataset can be adjusted in the inputs.
Primary/Secondary Score – the Spearman’s Rho value. A value closer to 1 would suggest that the input cluster is more similar to a cell type in the Atlas.
You can add the predicted cell types to your Seurat object using the code below.