CellMapper
Last updated
Last updated
CellMapper leverages our atlases for users to project their data upon. The results are visualized as a UMAP, allowing you to easily identify cell types common to both your input and the reference atlas.
Additionally, we determine if cell types present in the user input are found in the reference atlases, allowing you to quickly identify cell types specific to your sample, which could indicate novel cell types not currently captured in our atlases, or cell types unique to perturbations or disease states in your specific study. The results can be downloaded and these cells can be filtered out for subsequent downstream analyses such as DEG analysis.
CellMapper requires 2 inputs from the user:
Selection of a reference atlas
A Seurat Object with the normalized RNA assay data in the 'data' slot
Please note:
Only Seurat V4 is supported.
If your data comprises multiple samples, please either use batch-corrected values or use raw values for each sample individually. This prevents batch effects from confounding your results.
A maximum file size of 1G is accepted. File size can be reduced with the following code:
TIP
You can explore this workflow using the provided example file from : AML0024_3p, AML0048_3p and AML006_3p, consisting three bone marrow samples taken from patients with Acute Myeloid Leukemia (AML).
User provides gene expression data and selects the reference atlas of interest.
Seurat FindIntegrationAnchors
is first used to find anchors between the reference atlas and all cells in your input, which are then used to integrate the reference dataset with your input.
CAUTION
At this juncture, you might receive a low anchor rate warning, suggesting that your data is too dissimilar to the reference atlas. This might suggest poor data quality, or selection of an inappropriate atlas as a reference.
Principal Component Analysis (PCA) is used to identify the appropriate number of principal components in the integrated data, which is used as an input to run non-linear dimensional reduction (UMAP). The UMAPs for the reference atlas and the user data are shown side-by-side for easy identification of unique cell types in the users samples.
Beyond the main UMAP to show the predicted cell types, CellMapper utilizes 2 metrics for users to identify cell types specific to their samples. The first is distance_to_reference, which visualizes the distance between the user input and the reference. The distance is calculated as the averaged distance between every cell in the users sample and its 5 nearest neighbours in the reference atlas, and is taken as a measure of how dissimilar the cell is from the reference, with a darker color signifying greater dissimilarity.
The distances are used to generate a density plot, which usually appears as a gaussian mixture model. Bayesian Information Criterion (BIC) is used for model selection to determine the appropriate number of gaussian models that can be fitted based on the distribution characteristics of the distances. A maximum of 2 models can be fitted, which would indicate distributions corresponding to a 'reference-similar' and 'reference-dissimilar' group. mclust
is then used to fit guassian distributions and to provide the distribution parameters. A threshold distance, 8.88 in this example, is then determined based on the intersect of these 2 distributions, which is used to separate the 'reference-similar' (mapped) and 'reference-dissimlar' (unmapped) groups.
The second is the mapping_result, which visualizes cells with a distance greater than a threshold distance. These cells are labelled as 'unmapped' - the distance between these cells and the reference are large enough that they may be worth an in-depth look! For example, we can see that a large proportion of unmapped cells appear to be situated closely to the CD14+ MHCII High monocyte population.
CellMapper allows the user to group their results according to any provided metadata, and displays the results as both a UMAP and a heatmap. In this example, the provided metadata groups the data into 'Malignant', 'microenvironment' and 'NA'.
By splitting across the metadata [tumor], we can see that the unmapped CD14+ MHCII high monocyte population appears to be largely concentrated in the malignant samples, with a group closer to the original CD14+ MHCII high monocyte population in the reference (boxed in green) and a distinctly separate group (boxed in red). We interpret this to suggest that malignant transformation might have given rise to different cell states in the former that are still similar to CD14+ MHCII high monocytes, or distinct cell types in the latter case.
TIP
The download tab allows you to download the Seurat object of the transferred data (.rds), the predicted cell types (.txt) and useful output plots for publication. These could be used for subsequent downstream analyses. As a simple example, we might be interested to perform a DEG analysis specifically with the CD14+ MHCII high monocyte population in the green box, since these are more likely to indicate monocytes that display expression profiles that have been perturbed by malignancy, while filtering out those in the red box, which are likely to be different cell types entirely, hence their higher distance to reference. This can be done with the following code:
You might notice that this reference UMAP no longer looks the same as the ! This is because CellMapper first integrates the user input with the reference before performing data clustering and generating a new UMAP. This integration step avoids biasing the UMAP to the cell types already present in the reference, allowing for sample-specific cell types in the user data to be identified.
To add metadata to your Seurat Object, please refer to the guide .