CELLiD

CELLiD serves as our in-house cell type predictor to annotate single-cell data quickly and accurately, using either our provided datasets as a reference, or using your own customized atlas. CELLiD was also employed for the annotation of the DISCO datasets found in the repository.

How it works

The accuracy of an automated cell type annotator is highly contingent on the quality of the underlying reference dataset. Therefore, we leverage our own Atlas which we have manually annotated, using the cell ontology as a guide for naming conventions. We also constructed a hierarchical tree to illustrate the relationship between different cell types based on their gene expression profiles.

With the DISCO Atlas as our reference dataset, we developed CELLiD. CELLiD uses the average expression levels of all genes in a cell cluster as the input. Two rounds of prediction are then carried out. In the first round (coarse grain stage), the Spearman correlations are computed using all overlapping genes between each input cell cluster and every cell type in the reference dataset. A higher correlation value would imply greater similarity in terms of gene expression values. We retain the top 20 ranked cell types for the next stage. We chose to use Spearman’s Rho over Pearson’s correlation coefficient because the former can accommodate non-linear relationships in the expression values.

In the second round (fine grain stage), the top 3000 highly variable genes (HVGs) in the retained cell types are used to re-calculate the Spearman correlations. The two cell types with the highest correlation coefficients are reported.

By splitting across 2 predictions with increasing granularity, the accuracy of CELLiD is improved. As the underlying Atlas is expanded and updated, CELLiD will also take the new data into account, further increasing its accuracy.

Input

CELLiD requires 2 inputs from the user:

  1. A table containing the average gene expression values for each cluster provided in this format:

  1. Selection of a reference dataset

Generate input for CELLiD by Seurat

rna.data.average = AverageExpression(rna.data)
# This will generate average expression for each cluster

rna.data.average = round(rna.data.average$RNA, 2)
write.table(rna.data.average, "CELLiD_input.txt", quote = F, col.names = F, row.names = T, sep="\t")
# Then, you can upload this file to our server for CELLiD prediction

A filter is provided in the input data section to select the reference dataset.

This will restrict CELLiD to the selected reference datasets when calculating the correlation between the user’s input clusters and reference cell types.

Workflow

Obtain user input

User provides gene expression data and selects the reference dataset of interest. CELLiD begins by selecting the first cluster for prediction.

Prediction Round 1: Coarse Grain Stage

For each cell type present in the atlas, take the overlapping genes and compute a Spearman Correlation with the gene expression values. A dummy dataset is used for explanation purposes here.

Prediction Round 2: Fine Grain Stage

For the retained cell types, a second round of prediction is done using the HVGs. For simplicity, only 25 of the 3000 HVGs are shown here.

Results

This workflow is repeated for all clusters and the results are generated.

The results are organized as follows:

  • Input ID – cluster number

  • Primary/Secondary Prediction – the top two most likely cell types CELLiD has identified the cluster to be, based on the pipeline above.

  • Primary/Secondary Tissue – the likely tissue origin of the cluster, based on the specific tissue atlas that was used by CELLiD to identify the cluster. To restrict the atlases used by CELLiD to determine cluster identity, the reference dataset can be adjusted in the inputs.

  • Primary/Secondary Score – the Spearman’s Rho value. A value closer to 1 would suggest that the input cluster is more similar to a cell type in the Atlas.

You can add the predicted cell types to your Seurat object using the code below.

# You can add the predicted cell types to Seurat object as follows:
predicted.ct = read.csv("spreadsheet.csv")
rna.data$primary.predict = predicted.ct[as.numeric(rna.data$seurat_clusters),2]
rna.data$secondary.predict = predicted.ct[as.numeric(rna.data$seurat_clusters),3]
DimPlot(rna.data, group.by = "primary.predict", label = T)

Last updated