--- title: "RaceID/StemID/VarID reference manual" date: "`r Sys.Date()`" author: "Dominic Grun" output: html_document: toc: true toc_depth: 3 vignette: > %\VignetteIndexEntry{An introduction to RaceID and StemID.} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- \tableofcontents ```{r echo=FALSE} knitr::opts_chunk$set(fig.width=8, fig.height=8, dpi=50, dev='jpeg') ``` # Prerequisites This vignette provides a manual for the analysis of single-cell RNA-seq data with three different methods: 1. RaceID3 [@FateID]: pre-processing, clustering and outlier identification 2. StemID2 [@FateID]: lineage tree inference (dependend on a prior RaceID3 analysis) 3. VarID2 [@VarID2]: pre-processing, clustering, quantification of technical and biological noise, pseudotime analysis (optional integration with RaceID3) RaceID [@RaceID] (the current version is RaceID3 [@FateID]) is a method for cell type identification from single-cell RNA-seq data by unsupervised learning. An initial clustering is followed by outlier identification based on a backgorund model of combined technical and biological variability in single-cell RNA-seq data with unique molecular identifiers. StemID [@StemID] (the current version is StemID2 [@FateID]) permits inference of a lineage tree based on clusters identified by RaceID. The current version of RaceID (RaceID3) and StemID (StemID2) are published together with the FateID algorithm [@FateID]. The current version of the RaceID package implements additional improvements in rare cell type detection, offers batch effect removal utilities, optional imputing of gene expression, and substantially decreases runtime as well as memory usage. VarID [@VarID] (the current version is VarID2 [@VarID2]) is implemented as part of the RaceID package. VarID2 allows the quantification of local gene expression variability and the deconvolution into technical and biological noise components, and provides an alternative clustering approach based on community detection. VarID2 analysis is independent of RaceID but can be fully integrated into RaceID \code{SCseq} objects. RaceID is integrated with the FateID method for the prediction of cell fate probabilities [@FateID], which is available as a separate package. # RaceID ## Example RaceID analysis Input to RaceID is a matrix of raw expression values (unique molecular identifiers with or without Poisson correction [@Noise]) with cells as column and genes as rows. This matrix can be provided as a matrix object, a data.frame or a sparse matrix produced by the `Matrix` package. The `RaceID` package comes with sample data `intestinalData` for murine intestinal epethelial cells stored in sparse matrix format. The dataset was published previously [@StemID]. RaceID and StemID functions have various input and output parameters, which are explained in detail on the `help` pages. Here, we mostly use default parameters, which represent a good choice for common datasets. To start the analysis, a RaceID single-cell sequencing (`SCseq`) object is initialized with a count matrix. ```{r results='hide', message=FALSE} library(RaceID) sc <- SCseq(intestinalData) ``` The first step is the application of filtering for the purpose of quality control. Cells with a relatively low total number of transcripts are discarded. ```{r} sc <- filterdata(sc,mintotal=2000) ``` In this example, we filter out cells with <2,000 transcipts. The function allows further filtering of genes by choosing the input parameters `minexpr` and `minnumber`, i.e. only genes with at least `minexpr` transcripts in at least `minnumber` cells are retained. The filtered and normalized expression matrix (normalized to the minimum total transcript count across all cells retained after filtering) can be retrieved by the `getfdata` function: ```{r} fdata <- getfdata(sc) ``` The `filterdata` function allows the detection and removal of batch effects by different methods as outlined below in addtional examples. Alternatively, individual genes or groups of genes correlating to specific genes can be removed with the `FGenes` and `CGenes` arguments. This frequently allows a minimally invasive removal of batch effects or of particularly highly expressed genes with an unwanted dominating effect on the clustering. Next, the distance matrix is computed as the input for clustering and outlier identification. This can be done with or without imputing gene expression from nearest neighbours (see example below for imputing). RaceID offers various alternative metric, e.g. spearman, kendall, and euclidean, as well as measure of proportionality (rho and phi from the `propr` package). ```{r} sc <- compdist(sc,metric="pearson") ``` This function computes the distance matrix based on highly variable genes by default. If all genes should be used, then the parameter `FSelect` needs to be set to `FALSE`. On this distance matrix clustering is performed: ```{r results='hide', message=FALSE} sc <- clustexp(sc) ``` To infer the initial cluster number, this function computes the average within-cluster dispersion up to a number of clusters specified by the `clustnr` arguments, which equals 30 by default. If more major populations are expected, this parameter needs to be set to higher values which will increase the run time. The initial cluster number only serves as a rough estimate, since the subsequent outlier identification will infer additional clusters. The internal inference of the cluster number and the evaluation of cluster stability by the computation of Jaccard's similarity is done on all cells by default. For large datasets it is reasonable to subsample a limited number of cells, by setting the `samp` argument, e.g., to 1,000. In this case, the cluster number is inferred based on this smaller subset and Jacccard's similarity is not computed by bootstrapping but by sub-sampling. For k-medoids clustering, subsetting will imply almost deterministic clustering partitions, if the sample size approaches the size of the dataset. Therefore, `samp` should be signicantly smaller then the size of the dataset. Otherwise, bootstrapping is better for assessing the cluster stability. Subsampling can be expected to give a good estimate of the number of major clusters. Additional small clusters which might have been missed by the sampling can be reintroduces during the outlier identification step. The inferred cluster number can be inspected in a saturation plot, which shows the decrease of the average within-cluster dispersion with increasing cluster number. If this decrease becomes constant, saturation is reached. The automatically chosen cluster number is detected such that the decrease is equal to the decrease upon further increasing the cluster number within the error bars: ```{r} plotsaturation(sc,disp=FALSE) ``` The average within-cluster dispersion can also by plotted: ```{r} plotsaturation(sc,disp=TRUE) ``` The cluster stability as assessed by Jaccard's similarity should also be inspected: ```{r} plotjaccard(sc) ``` In this example, the automated criterion overestimated the cluster number leading to instability as indicated by low Jaccard's similarity. Based on visual inspection of the average within-cluster dispersion as a function of the cluster number, we manually set the cluster number to 7 without recomputing the saturation behaviour. ```{r results='hide', message=FALSE} sc <- clustexp(sc,cln=7,sat=FALSE) ``` This function perform k-medoids clustering by default. K-means clustering or hierarchical clustering can be chosen with the `FUNcluster` argument. For very large datasets, hierarchical clustering leads to significantly smaller run time. Subsequently, outliers in the initial k-medoids clusters are identified based on an internally computed background model for the expected gene expression variability and the assumption that transcript counts follow a negative binomial distribution defined by the mean and the variance of the expression of each gene in a cluster. Outlier genes will be in the tail of this distribution at a p-value defined by the `probthr` parameter (1e-3 by default), and outlier cells require the presence of a number of outlier genes defined by the `outlg` parameter (2 by default). ```{r results='hide', message=FALSE} sc <- findoutliers(sc) ``` In contrast to previous versions, outlier genes are inferred from non-normalized transcript counts, which follow a negative binomial distribution modeling the joint technical and biological variability. The assumption of a negative binomial distribution was demonstrated for raw transcript (UMI) count data, but is not strictly valid for normalized expression values [@Noise]. Hence, RaceID does not require normalization, since the correlation-based metric for the computation of the distance object is also independent of the normalization. Normalizaion is only relevant when using, e.g., the euclidean metric for the derivation of the distance matrix. RaceID applies a simple size normalization for data representation and follow-up analyses. The background noise model can be inspected: ```{r} plotbackground(sc) ``` The number of outliers as a function of the p-value can be plotted: ```{r} plotsensitivity(sc) ``` Another way of checking the presence of outliers is the inspection of a barplot of p-values across all cells in each cluster: ```{r} plotoutlierprobs(sc) ``` A heatmap of cell-to-cell distances grouped be the final clusters inferred based on the original clusters and the outliers allows visual inspection of the clusters: ```{r eval=FALSE} clustheatmap(sc) ``` This function is not recommended for very large datasets, since it produces similarly large plotting output. The best way of visualising the detetcted cell types is plotting cells and clusters in a two-dimensional reduction representaion. RaceID can compute a t-SNE map ```{r} sc <- comptsne(sc) ``` or a k-nearest neighbour graph layout utilizing the Fruchterman-Rheingold algorithm: ```{r} sc <- compfr(sc,knn=10) ``` In this example, the number of nearest neighbours was chosen to be 10. In general, different values for `knn` should be tested to find an ideal layout. RaceID further allows computation of a umap dimensional reduction representation: ```{r} sc <- compumap(sc) ``` Parameters for umap can be given as additional argument (see help function). The t-SNE map can be plotted by ```{r} plotmap(sc) ``` The Fruchterman-Rheingold layout can be plotted by ```{r eval=FALSE} plotmap(sc,fr=TRUE) ``` The umap representation can be plotted by ```{r eval=FALSE} plotmap(sc,um=TRUE) ``` Maps can be changed for both t-SNE and Fruchterman-Rheingold algorithm by initializing the `rseed` argument of the `comptsne` or `compfr` function with a random number. The dimensional reduction maps can be inspected, e.g., for the localization of (a subset of) samples included in the analysis: ```{r} types <- sub("(\\_\\d+)$","", colnames(sc@ndata)) subset <- types[grep("IV|V",types)] plotsymbolsmap(sc,types,subset=subset,cex=1) ``` Expression of a gene of interest or aggregated expression for a group of genes can be highlighted in the dimensional reduction representation: ```{r} plotexpmap(sc,"Lyz1",logsc=TRUE,cex=1) g <- c("Apoa1", "Apoa1bp", "Apoa2", "Apoa4", "Apoa5") plotexpmap(sc,g,n="Apoa genes",logsc=TRUE,cex=1) ``` It is also possible to highight expression only for a subset of cells, e.g. a particular batch or sample: ```{r} sample <- colnames(sc@ndata)[grep("^I5d",colnames(sc@ndata))] plotexpmap(sc,"Lyz1",cells=sample,logsc=TRUE,cex=1) ``` For the murine intestinal example data, inspetion of known marker genes suggests that cluster 2 and 3 correspnd to Lgr5-expressing intestinal stem cells. Cluster 2 is proliferative as indicated by up-regulation of Mki67, Cluster 1 comprises transiently amplifying progenitors biased towards absorptive entorytes in cluster 4 marked by Apolipoproteins. Cluster 7 comprises Lysozyme-expressing Paneth cells while Mucus-producing goblet cells constitute clusters 6 and 10. With the help of known marker genes, e.g., Chgb for enteroendocrine cells, Muc2, Agr2, and Clca3 for goblet cells, and Defa20 for Paneth cells, rare cell types among the detected RaceID clusters can be annotated: ```{r} genes <- c("Lyz1","Defa20","Agr2","Clca3","Muc2","Chgb","Neurog3","Apoa1","Aldob","Lgr5","Clca4","Mki67","Pcna") plotmarkergenes(sc,genes=genes) ``` To inspect clusters in more detail, differentially expressed genes can be inferred by an internal approach akin to DESeq2 [@DESeq2] but with a dispersion estimated globally from the background model of RaceID. For instance, to obtain differentially expressed genes within cluster 9 compared to all other cells: ```{r} d <- clustdiffgenes(sc,4,pvalue=.01) dg <- d$dg head(dg,25) ``` The differentially expressed genes (in this example only the up-regulated ones with a fold change >1) can be plottted in a heatmap, which can highlight the clusters and samples of origin: ```{r} types <- sub("(\\_\\d+)$","", colnames(sc@ndata)) genes <- head(rownames(dg)[dg$fc>1],10) plotmarkergenes(sc,genes,samples=types) ``` The heatmap can also be ordered by cell names (i.e. by batch or sample) by setting `order.cells` to `TRUE`. With the input parameters `cl` and `cells`, the heatmap can be restricted to a subset of clusters or cells, respectively. ```{r eval=FALSE} plotmarkergenes(sc,genes,cl=c(2,3,1,4),samples=types,order.cells=TRUE) ``` In this example, no inter-sample differences are apparent and all samples contribute to each cluster. It is also possible to plot gene expression across clusters or samples using a dot plot, where the diameter represents the fraction of cells in a sample or cluster expressing a gene, and the color indicates the expression level or z-score. For example, the following commmand plots expression z-scores across clusters 2, 3, 1 and 4: ```{r} fractDotPlot(sc, genes, cluster=c(2,3,1,4), zsc=TRUE) ``` Expression on a log2 scale across samples I, II, and III can be plotted by ```{r} samples <- sub("(\\d.+)$","", colnames(sc@ndata)) fractDotPlot(sc, genes, samples=samples, subset=c("I","II","III"), logscale=TRUE) ``` A differential gene expression analysis between two defined sets of cell, e.g., two (groups of) clusters can be performed: ```{r} A <- names(sc@cpart)[sc@cpart %in% c(2,3)] B <- names(sc@cpart)[sc@cpart %in% c(4)] x <- diffexpnb(getfdata(sc,n=c(A,B)), A=A, B=B ) plotdiffgenesnb(x,pthr=.05,lthr=.5,mthr=-1,Aname="Cl.2,3",Bname="Cl.4",show_names=TRUE,padj=TRUE) ``` See the paragraphs below for additional options of RaceID analyses and parameter choices ideal for analysing large datasets. # StemID ## Example StemID analysis: projection mode StemID is an algorithm for the inference of lineage trees and differentiation trajectories based on pseudo-temporal ordering of single-cell transcriptomes. It utilizies the clusters predicted by RaceID and thus requires to run RaceID first. The algorithm was originally published along with RaceID2 [@StemID] and the improved current version StemID2 was published later [@FateID]. In a nutshell, StemID infers links between clusters which are more populated with intermediate single-cell transcriptomes than expected by chance. To assign cells to inter-cluster links, two fundamentally different strategies are available (see `nmode` argument below). The first strategy considers the projection of a vector connecting a cluster medoid to a member cell of the same cluster onto the links from the medoid of its cluster to the medoids of all other clusters. The longest projection identifies the link this cell is assigned to and defines the projection coordinate. The second (nearest neighbour) mode identifies for a cell in a given cluster the number of k nearest neighbours in each other cluster and assigns the cell to the link with the cluster where the average distance to these k nearest neighbours is minimized. The coordinate on the link is inferred as in the first mode. A faster approximate version of the first mode is also implemented. As a first step, a lineage tree object for the StemID analysis needs to be initialized with an SCseq object obtained from a RaceID analysis: ```{r} ltr <- Ltree(sc) ``` Next, the transcriptome entropy of cell needs to be calculated. This is used by the StemID algorithm for the prediction of the stem cell type, based on maximum transcriptome entropy and maximum number of links to other clusters. ```{r} ltr <- compentropy(ltr) ``` In the subsequent step, cells are projected onto inter-cluster links. Cells are assigned to a link based on minimum distance to k nearest neighbours (`nmode=TRUE`) or based on the maximum projection coordinate (`nmode=FALSE`). Only clusters with `>cthr` cells are included in the analysis. If `fr=TRUE` then the Fruchterman-Rheingold layout will be used for representation of the inferred lineage tree, and if `um=TRUE` a UMAP representation will be used. Otherwise, representation will be done in t-SNE space. The computation of the lineage graph is independent of the dimensional reduction method which is only used for visualization. ```{r} ltr <- projcells(ltr,cthr=5,nmode=FALSE) ``` If projections are used for link determenation (`nmode=FALSE`), the derivation of link significance is done by comparing to the link population after randomizing cell positions within the boundaries imposed by the gene expression manifold. This is done by bootstrapping using 500 randomizations by default. More randomizations are possible, but obviously linearly increase runtime. ```{r results='hide', message=FALSE} ltr <- projback(ltr,pdishuf=100) ``` Based on this information, a lineage graph is computed to approximate the lineage tree (a tree structure is not strictly imposed). ```{r results='hide', message=FALSE} ltr <- lineagegraph(ltr) ``` Finally, link p-values are computed and a threshold `pthr` is applied on the p-values: ```{r} ltr <- comppvalue(ltr,pthr=0.1) ``` The resulting graph can be plotted, overlaid with a dimensional reduction representation (Fruchterman-Rheingold or t-SNE, see `projcells`). To retain only the more populated links, a cutoff `scthr` on the linkscore can be applied, e.g. 0.2: ```{r} plotgraph(ltr,scthr=0.2,showCells=FALSE,showMap=TRUE) ``` To predict the stem cell, the StemID score can be computed and visualized: ```{r} x <- compscore(ltr,scthr=0.2) ``` StemID offers a number of plotting functions to inspect the results. RaceID performs clustering using Pearson's correlation as a metric by default. The StemID projections require a Euclidean space and thus an initial embedding into a high-dimensional space is performed by classical multidimensional scaling. To inspect how well cell-to-cell distances are preserved, a histogram of the log-ratios between the original and transformed distances can be plotted: ```{r} plotdistanceratio(ltr) ``` The StemID prediction can be compared to a minimal spanning tree of the cluster medoids: ```{r} plotspantree(ltr) ``` The cell projections onto the links can be directly compared with this minimal spanning tree: ```{r} plotspantree(ltr,projections=TRUE) ``` All linkscores and fold enrichments of the population on a link can be plotted as heatmaps: ```{r} plotlinkscore(ltr) projenrichment(ltr) ``` The (z-scores of the) projections of all cells from a given cluster across all links can be plotted as heatmap, e.g. for cluster 3: ```{r} x <- getproj(ltr,i=3) ``` All cells on two different branches originating from the same cluster, e.g. cluster 3 cells on the links to cluster 1 and 8, can be extracted for the computation of differentially expressed genes: ```{r} x <- branchcells(ltr,list("1.3","3.8")) head(x$diffgenes$z) ``` The cells on the two branches can be plotted as additional clusters in the dimensional reduction representation: ```{r} plotmap(x$scl,fr=TRUE) ``` ## Example StemID analysis: nearest-neighbour mode Since the randomizations of cell positions for the derivation of link significance require long computation time, and the projection-based method leads to some weak links which are potentially false positives (and can be filtered out based on linkscore), the nearest-neighbour-based method has now been selected to be the default mode of StemID. This method is more robust and fast even on large datasets. The downside is that it will miss some weak links, i.e. lead to more false negatives in comparison to the projection mode. First, a lineage tree object needs to be initialized followed by the calculation of the transcriptome entropy of each cell. ```{r} ltr <- Ltree(sc) ltr <- compentropy(ltr) ``` Next, cell projection are calculated with the parameter `nmode=TRUE`, which is also the default value: ```{r} ltr <- projcells(ltr,cthr=5,nmode=TRUE,knn=3) ``` The `knn` parameter determines how many nearest neighbours are considered in each cluster for determining the link assignment: the distance two each cluster is calculated as the average across the distance of a cell to the `knn` nearest neighbours within each other cluster, and the cell is assigned to the link with the cluster minimizing this distance. Now, the lineage tree is inferred and the p-values for the links are calculated based on a binomial model: ```{r results='hide', message=FALSE} ltr <- lineagegraph(ltr) ltr <- comppvalue(ltr,pthr=0.05) ``` The resulting lineage graph can be inspected and reveals the expected trajectories connecting the stem cells (cluster 2 and 3 of cycling and quiescent cells, respectively) to enterocytes (cluster 4) via transiently amplifying progenitors (cluster 1), to Paneth cells (cluster 7), and to goblet cells (cluster 6). The StemID score suggests stem cell identity for clusters 2 and 3: ```{r} plotgraph(ltr,showCells=FALSE,showMap=TRUE) x <- compscore(ltr) ``` # RaceID/StemID Options ## Batch effect removal RaceID offers the possibility of batch correction utilizing an internal method or the published `mnnCorrect` function from the `batchelor` package [@mnnCorrect]. The `batchelor` package needs to be separately installed from bioconductor if this mode is desired. In order to apply batch effect removal, a list with a vector of cell ids for each batch needs to be defined, e.g.: ```{r eval=FALSE} n <- colnames(intestinalData) b <- list(n[grep("^I5",n)],n[grep("^II5",n)],n[grep("^III5",n)],n[grep("^IV5",n)],n[grep("^V5",n)]) ``` This list is provided as input to the `filterdata` function, and the `bmode` argument is initialized with the desired method, i.e. `mnnCorrect` or `RaceID`. The latter method simply compares the local neigbourhood, i.e. the set of k nearest neighbours, for each cell between two batches and identifies the neighbourhood of the two batches with the smallest average distance. A differential gene expression analysis between the closest neighbourhoods of two batches yields batch associated genes. The next batch is then compared in the same way to the merged dataset of the previous batches. Batches are compared and successively merged according to the order they are provided in `b`. An additional input parameter `knn` controls the number of nearest neighbours, i.e. the size of the neighbourhood. ```{r eval=FALSE} sc <- SCseq(intestinalData) sc <- filterdata(sc,mintotal=2000,LBatch=b,bmode="RaceID",knn=10) ``` The `filterdata` function will identify all batch-associated genes, which are stored in the `filterpar$BGenes` slot of the `SCseq` object. All genes that correlate to a batch gene are removed for the computation of a distance object. This is a minimally invasive strategy in comparison to `mnnCorrect`, which works well if batches are very similar, such as datasets produced from the same material using the same single-cell RNA-seq technology. ## Imputing of gene expression RaceID also offers optional imputing of gene expression, which can be useful if gene expression differences between cell types or states are governed only by lowly expressed genes and are difficult to resolve by clustering based on raw counts. If imputing is desired, the `knn` argument needs to be initialized with the number of nearest neighbours used for imputing gene expression : ```{r results='hide', message=FALSE, eval=FALSE} sc <- compdist(sc,knn=5,metric="pearson") ``` Now, for each cell the `knn` nearest neighbours are used to infer a local negative binomial for each gene, utilizing a weighted mean expression and the internal RaceID noise model to obtain the corresponding negative binomial. The weights are derived by quadratic programming, computing the expression vector of a cell as a linear combination of its `knn` nearest neighbours. The cell itself contributes with the same weight as the aggregated weights of the nearest neighbours to the weighted mean expression. With the help of this negative binomial the tail probability of each gene is derived across all `knn` nearest neighbours. The geometric means of these tail probabilities are finally applied as a weights for each nearest neighbours in the calculation of the imputed gene expression. This strategy ensures that gene expression can only be learned from nearest neighbours following the same transcript count distributions. After this, all steps remain the same. Imputing often helps to improve cluster discrimination and stability. Importantly, distances derived from imputed gene expression are only used for clustering. The outlier identification relies on unimputed gene expression, and hence can correct spurious clusters produced from imputed values. ```{r results='hide', message=FALSE, eval=FALSE} sc <- clustexp(sc) sc <- findoutliers(sc) sc <- compfr(sc) sc <- comptsne(sc) plotmap(sc,fr=TRUE) ``` If batch effect removal has been applied, the remaining batch effect can be checked by plotting symbols representig the sample of origin: ```{r eval=FALSE} types <- sub("(\\_\\d+)$","", colnames(sc@ndata)) plotsymbolsmap(sc,types,fr=TRUE) ``` Ideally, all samples should intermingle in each clusters. Imputed gene expression can be plotted by setting the `imputed` argument to `TRUE`. Otherwise, unimputed values are shown. ```{r eval=FALSE} plotexpmap(sc,"Mki67",imputed=TRUE,fr=TRUE) plotmarkergenes(sc,c("Clca4","Mki67","Defa24","Defa20","Agr2","Apoa1"),imputed=TRUE,samples=types) ``` An expression matrix with imputed expression values can be extracted for further analysis: ```{r eval=FALSE} k <- imputeexp(sc) ``` ## Removing variability by regression RaceID can also regress out variability associated with particular sources such as batches or cell cycle. If batch effect removal has been done by the `filterdata` function with `bmode="RaceID"` then this function can further regress out residual variability remaining after batch associated genes have been discarded for the distance computation. In the case, the argument `Batch` has to be set to `TRUE` and `vars` can be left empty if no further sources of variability should be regressed out. Batch effects can also be regressed out directly without prior removal using the `filterdata` function: ```{r results='hide', message=FALSE, eval=FALSE} sc <- SCseq(intestinalData) sc <- filterdata(sc,mintotal=2000) vars <- data.frame(row.names=colnames(intestinalData),batch=sub("(\\_\\d+)$","",colnames(intestinalData))) sc <- varRegression(sc,vars) sc <- compdist(sc,metric="pearson") sc <- clustexp(sc) sc <- findoutliers(sc) sc <- comptsne(sc) sc <- compfr(sc) plotmap(sc) ``` However, regression also leads to loss of biological variation and this step is only recommended if variability associated with a particular variable is a strong confounding factor and cannot be removed by other means. ## Prior dimensional reduction and removal of variability by PCA/ICA After running the `filterdata` function, a prior dimensional reduction using PCA or ICA can be performed using the `CCcorrect` function. This function can also be provided with a list `vset` of sets of genes, and principal components with loadings enriched in any of these sets will be discarded. Another options is to provide a list `CGenes` of genes, and sets to be tested for enrichment in each component are derived as the groups of all genes correlating to a component in FGenes. The `CCcorrect` function predicts a dimension for the prior dimensional reduction based on an ellbow function of the explained variability as a function of the number of components. This can be inspected by the `plotdimsat` function and manually adjusted: ```{r results='hide', message=FALSE, eval=FALSE} sc <- SCseq(intestinalData) sc <- filterdata(sc,mintotal=2000) sc <- CCcorrect(sc,dimR=TRUE) plotdimsat(sc) plotdimsat(sc,change=FALSE) sc <- filterdata(sc,mintotal=2000) sc <- CCcorrect(sc,nComp=9) sc <- compdist(sc,metric="pearson") sc <- clustexp(sc) sc <- findoutliers(sc) sc <- comptsne(sc) sc <- compfr(sc) plotmap(sc) ``` Rerunning `CCcorrect` requires to run `filterdata` first, because otherwise the dimensional reduction scores in the `dimRed` slot will be subject to a second dimensional reduction, which is not desired. All sub-sequent steps remain unaltered. ## Inferring stable clusters by random forests analysis RaceID provides the option to run a random forests based reclassifiction, in order to obtain a stable clustering partition. This can be done on the final clustering after running `findoutliers`: ```{r results='hide', message=FALSE, eval=FALSE} sc <- SCseq(intestinalData) sc <- filterdata(sc,mintotal=2000) sc <- compdist(sc,metric="pearson") sc <- clustexp(sc) sc <- findoutliers(sc) sc <- rfcorrect(sc) sc <- comptsne(sc) sc <- compfr(sc) plotmap(sc) ``` However, this is normally not required, since the improved outlier detection of the current version leads to stable clusters which do not change substantially after applying this function. Running `rfcorrect` takes very long for large datasets and should be omitted in this case. ## Parameters: Large Datasets For large dataset (>25k cells), we recommend using the VarID2 approach (see below) for density based clustering which does not require computation and storage of a huge distance matrix. For datasets below this size we recommend running RaceID/StemID with specific parameter settings. Since the correlation metric does not require dataset normalization, this approach has certained advantages compared to VarID2. However, the latter can also be run with a correlation matrix. In the following, we discuss a few paramters critical for the runtime of RaceID/StemID on large datasets. ```{r results='hide', message=FALSE, eval=FALSE} sc <- SCseq(intestinalData) sc <- filterdata(sc,mintotal=2000) sc <- compdist(sc) ``` Preferentially, clustering should be done by `FUNcluster="kmedoids"` but `"hclust"` often gives similar results and is significantly faster. ```{r results='hide', message=FALSE, eval=FALSE} sc <- clustexp(sc,samp=1000,FUNcluster="hclust") ``` For the determination of the cluster number and the inference of Jaccard's similarity, sub-sampling should be applied by setting the subset size `samp` to an integer number. A good choice could be 10-25% of the total number of cells. For large datasets this should be sufficient to discriminate the most abundant cluster, and additional smaller clusters will automatically be identified in the next step using the `findoutliers` function. It is important that the `clustnr` argument is much larger then the expected number of clusters. If stability analysis is not wanted, one can set `bootnr=1`. If clustering is re-run with a specific cluster number, e.g. `cln=12`, then the saturation criterion should be disabled by setting `sat=FALSE`. If the cluster granularity is too fine or too coarse, the `probthr` argment can be decreased or increased, respectively, e.g.: ```{r results='hide', message=FALSE, eval=FALSE} sc <- findoutliers(sc,probthr=1e-4) ``` It is adviced to change the `perplexity` argument to larger values, when computing the t-SNE map for large datasets, e.g. `perplexity=200`. However, a large perplexity will return an error for small datasets. ```{r results='hide', message=FALSE, eval=FALSE} sc <- comptsne(sc,perplexity=100) plotmap(sc) ``` The Fruchterman-Rheingold layout critically depends on the number `knn` of nearest neighbours, and different values should be tested: ```{r results='hide', message=FALSE, eval=FALSE} sc <- compfr(sc,knn=10) plotmap(sc,fr=TRUE) ``` For a sub-sequent StemID analysis of very large datasets, the nearest-neighbour mode (see above) will help to reduce the runtime. ## Inspecting pseudo-temporal gene expression changes To inspect pseudotemporal expression profiles, functions provided by the `FateID` package can be utilized. First, the trajectory needs to be defined based on a sequence of clusters. This sequence should ideally correspond to a trajectory predicted by StemID2, i.e. the clusters should be connected by a series of significant links. However, non-linked clusters can also be included. A pseudo-temporally ordered vector of cells along a StemID trajectory can be extracted with the `cellsfromtree` function from an `Ltree` object: ```{r} n <- cellsfromtree(ltr,c(2,1,4)) ``` The list `n` contains a vector `n$f` of cell ids ordered by their projection coordinate along the trajectory reflecting differentiation pseudotime. This vector and a filtered or unfiltered expression matrix can be used as input for pseudo-temporal gene expression analysis. The filtered expression data used for RaceID3 can be extracted with the `getfdata` function: ```{r} x <- getfdata(ltr@sc) ``` Additional filtering and subsetting of the gene expression matrix for cells on the trajectory, `n$f`, is done in the next step, utilizing functions from the `FateID` package: ```{r results='hide', message=FALSE, warnings=FALSE} library(FateID) fs <- filterset(x,n=n$f) ``` The `filterset` function can be used to eliminate lowly expressed genes on the trajectory from the subsequent analysis and has two additional arguments to discard genes, which are not expressed at a level of `minexpr` or higher in at least `minnumber` of cells. The function returns a filtered expression data frame with genes as rows and cells as columns in the same order as in the input vector `n`. In the next step, a self-organizing map (SOM) of the pseudo-temporal expression profiles is computed: ```{r} s1d <- getsom(fs,nb=100,alpha=.5) ``` This map provides a grouping of similar expression profiles into modules. The first input argument is again an expression data frame. In this case, we use the filtered expression table generated by the filterset function to retain only genes that are expressed on the trajectory under consideration. Pseudo-temporal expression profiles along the differentiation trajectory of interest are computed after smoothing by local regression with smoothing parameter `alpha`. This function returns a list of the following three components, i. e. a som object returned by the function `som` of the package `som`, a data frame `x` with smoothed and normalized expression profiles, and a data frame `zs` of z-score transformed pseudo-temporal expression profiles. The SOM is then processed by another function to group the nodes of the SOM into larger modules and to produce additional z-score transformed and binned expression data frames for display: ```{r} ps <- procsom(s1d,corthr=.85,minsom=3) ``` The first input argument is given by the SOM computed by the function `getsom`. The function has two additional input parameters to control the grouping of the SOM nodes into larger modules. The parameter `corthr` defines a correlation threshold. If the correlation of the z-scores of the average normalized pseudo-temporal expression profiles of neighboring nodes in the SOM exceeds this threshold, genes of the neighboring nodes are merged into a larger module. Only modules with at least `minsom` genes are kept. The function returns a list of various data frames with normalized, z-score-transformed, or binned expression along with the assignment of genes to modules of the SOM (see man pages for details). The output of the processed SOM can be plotted using the plotheatmap function. First, in order to highlight the clustering partition `y` the same color scheme as in the `SCseq` object can be used: ```{r} y <- ltr@sc@cpart[n$f] fcol <- ltr@sc@fcol ``` Now, the different output data frames of the `procsom` function can be plotted. Plot average z-score for all modules derived from the SOM: ```{r eval=FALSE} plotheatmap(ps$nodes.z,xpart=y,xcol=fcol,ypart=unique(ps$nodes),xgrid=FALSE,ygrid=TRUE,xlab=FALSE) ``` Plot z-score profile of each gene ordered by SOM modules: ```{r} plotheatmap(ps$all.z,xpart=y,xcol=fcol,ypart=ps$nodes,xgrid=FALSE,ygrid=TRUE,xlab=FALSE) ``` Plot normalized expression profile of each gene ordered by SOM modules: ```{r eval=FALSE} plotheatmap(ps$all.e,xpart=y,xcol=fcol,ypart=ps$nodes,xgrid=FALSE,ygrid=TRUE,xlab=FALSE) ``` Plot binarized expression profile of each gene (z-score < -1, -1 < z-score < 1, z-score > 1): ```{r eval=FALSE} plotheatmap(ps$all.b,xpart=y,xcol=fcol,ypart=ps$nodes,xgrid=FALSE,ygrid=TRUE,xlab=FALSE) ``` In order to inspect genes within individual modules of the SOM, these genes can be extracted given the number of the module. The module numbers are contained in the return value `nodes` of the `procsom` function and can be extracted, e. g. for module number 24: ```{r} g <- names(ps$nodes)[ps$nodes == 24] ``` The average pseudo-temporal expression profile of this group can be plotted by the function `plotexpression`: ```{r} plotexpression(fs,y,g,n$f,col=fcol,name="Node 24",cluster=FALSE,alpha=.5,types=NULL) ``` In the same way it is possible to plot expression profiles of individual genes, e.g.: ```{r} plotexpression(fs,y,"Clca4",n$f,col=fcol,cluster=FALSE,alpha=.5,types=NULL) ``` It is also possible to highlight the data points as specific symbols, for example reflecting batches, by using the types argument: ```{r} plotexpression(fs,y,g,n$f,col=fcol,name="Node 24",cluster=FALSE,alpha=.5,types=sub("\\_\\d+","",n$f)) ``` # VarID2 ## Example VarID2 analysis VarID2 is a novel algorithm utilizing a k nearest neighbour graph derived from transcriptome similarities [@VarID2]. The key idea is the inference of locally homogenous neighbourhoods of cells in cell state space, i.e., all k nearest neighbours follow the same transcript count distributions as the cell to which they are nearest neighbours ("central" cell). The computation starts with k nearest neighbours and infers a negative binomial distribution for the transcript counts of each gene, derived from the mean-variance dependence computed locally across a central cell and its k nearest neighbours. The mean expression is derived from a weighted average across the k nearest neighbours, with weights inferred using quadratic programming, representing the central cell as a linear combination of its neighbours. The contribution of the central cell itself can be controlled by a tunable scaling parameter `alpha`. Typical values of `alpha` should be in the range of 0 to 10 (default 1). If `alpha` is `NULL`, then it is inferred by a local optimization, i.e., it is minimized under the constraint that the gene expression in the central cell does not deviate more then one standard deviation from the predicted weigthed mean across its k nearest neighbours, where the standard deviation is calculated from the predicted mean using the local background model (the average dependence of the variance on the mean expression). With the negative binomial distribution inferred from the local means and mean-variance dependence, the probabilities of observed transcript counts for all genes can be computed in each neighbour, and, after multiple testing correction, the geometric mean across the `nb` (deafult 3) genes with the lowest probability serves as a proxy for the link probability (i.e., the probability of neighbours being in the same state). If this probability falls below a user-defined threshold, the link to this neighbour is pruned After this procedure is applied to the nearest neighbours of each cell, a pruned k nearest neighbour graph with homogenous neighbourhoods in cell state space is obtained, in which the remaining links indicate that neighbours follow the same transcript count distribution for all genes. These local neighbourhoods can be used to compute local properties such as gene expression variability, which in turn can be correlated to cell state or predicted fate, derived, e.g., from FateID analysis, respectively. Although VarID2 analysis can be performed independently of the RaceID data object, an integration with a RaceID analysis permits convenient visualization and processing of the results. For this reason, we here demonstrate an integrated pipeline, but explain at each step, how computation can be performed given minimal input data, i.e., a transcript count matrix. The VarID2 method requires unique molecular identifier (UMI) counts, which have been shown to follow a negative binomial distribution, in order for the background model to be valid. Given such a UMI matrix with cells as columns and genes as rows, we could first initialize a RaceID object, and perform cell and gene filtering to retain only expressed genes and high quality cells. We here demonstrate the functionalities of VarID2 on our single-cell transcriptome data of intestinal epithelial cells [@StemID]. ```{r} sc <- SCseq(intestinalData) sc <- filterdata(sc,mintotal=1000,FGenes=grep("^Gm\\d",rownames(intestinalData),value=TRUE),CGenes=grep("^(mt|Rp(l|s))",rownames(intestinalData),value=TRUE)) ``` In this example, all cells with a raw UMI count <1,000 are discarded, and all gene IDs starting with Gm (genes without official gene symbol, uncharacterized loci) are removed. Moreover, all mitochondrial and ribosomal sub-unit encoding genes as well as all genes correlating with these classes are filtered out. Although VarID2 analysis can be performed with any given distance matrix, the preferred (default) mode does not require a distance matrix, and relies on k-nearest neighbor analysis in PCA space. For large datasets, the computation of a distance matrix is prohibitive due to a memory requirement scaling with N^2 if N is the number of cells. In this case, the `pruneKnn` function can be executed with `large=TRUE` (default setting). Here, the input expression matrix is first optionally corrected for variability associated with total transcript count per cell by a negative binomial regression (if `regNB=TRUE` as by default; if `regNB=FALSE`, transcript counts in each cell are normalized to one, multipled by the minimal total transcript count across all cells, followed by adding a pseudocount of 0.1 and taking the logarithm). The mean dependence of the regression parameters is smoothed, and Pearson residual are computed. The matrix of Pearson residuals (default setting), or, alternatively, the raw count matrix then undergoes a dimensional reduction by a principle component analysis (PCA), including the top `pcaComp` principle components (PCs). The number of top PCs used for dimensional reduction is determined by an elbow criterion (requiring that the change in explained variability upon further increasing the number of PCs has become linear). However, the number of PCs can also be adjusted manually (parameter `pcaComp`). This reduced PC matrix is subsequently used for a fast k nearest neighbour search by the `get.knn` function from the `FNN` package based on euclidean distances. The `algorithm` parameter of the `pruneKnn` function enables the selection of the search algorithm to apply. In general, it is recommended to run VarID2 with `large=TRUE` and `regNB=TRUE` (default settings). A RaceID object can be utilized to facilitate the analysis, but in this mode the `compdist` function does not need to be executed: ```{r} expData <- getExpData(sc) res <- pruneKnn(expData,no_cores=1) ``` The `pruneKnn` function allows normalization by a full negative binomial regression (akin to [@sctransform]), or, alternatively, by analyical Pearson residuals as proposed by Lause et al. [@offSetModel] (default). In the latter case, the dispersion parameter can either be inferred by maximum likelihood, or approximated as $\theta$=10 (default). The function has a number of optional parameters controlling the outcome of the this step. For instance, `genes` allows limiting the analysis to a subset of genes, `knn` defines the number of nearest neighbours, `alpha` is the scale parameter for the central cell, and `FSelect` allows internal feature selection. The background model is calculated internally based on the local mean-variance relation in the same way as done in RaceID and feature selection follows the same logic. The function further permits multithreading and the number of cores can be defines using the `no_cores` argument. If this is set to `NULL`, the function detects the number of available cores and uses this number minus two to leave resources for other tasks running on the same machine. Calling `help` for this function explains all parameters in detail (as for all other functions). The coefficients of the negative binomial regression (or the coefficients of the analytical model, if `offsetModel=TRUE`) can be inspected. Intercept: ```{r eval=FALSE} plotRegNB(expData,res,"(Intercept)") ``` Coefficient $\beta$ of total log UMI count: ```{r eval=FALSE} plotRegNB(expData,res,"beta") ``` Dispersion parameter $\theta$: ```{r eval=FALSE} plotRegNB(expData,res,"theta") ``` In this example, the analytical model was used (no regression was calculated) and $\theta$=10 was used [@offSetModel]. To check if the mean-dependence of the variance has been successfully removed by the normalization, the Pearson residuals can be plotted as a function of the mean: ```{r} plotPearsonRes(res,log=TRUE,xlim=c(-.1,.2)) ``` The number of PCs selected based on the elbow criterion can be inspected by plotting the percentage variability explained by each principle component: ```{r} plotPC(res) ``` Alternatively, the difference between the percentage variability explained by PC i and PC i + 1 can be plotted on a logarithmic scale: ```{r} plotPC(res,logDiff=TRUE) ``` The pruned k nearest neighbour information can be used as input for graph inference, Louvain clustering, and the derivation of a Fruchterman-Rheingold graph layout: ```{r} cl <- graphCluster(res,pvalue=0.01) ``` Alternatively, community detection can be performed with the Leiden algorithm at variying resolution. However, the leiden package requires preinstallation of python libraries (https://cran.r-project.org/package=leiden). This can be done, e.g., by using python reticulate. ```{r eval=FALSE} install.packages("reticulate") reticulate::use_python("/usr/bin/python3", required=TRUE) #confirm that leiden, igraph, and python are available (should return TRUE). reticulate::py_module_available("leidenalg") && reticulate::py_module_available("igraph") reticulate::py_available() ``` With this `graphCluster` can be run using leiden clustering at defined resolution: ```{r eval=FALSE} cl <- graphCluster(res,pvalue=0.01,use.leiden=TRUE,leiden.resolution=1.5) ``` The function returns a list of three components: a vector `partition` with a clustering partition, a data.frame `fr` containing a Fruchterman-Rheingold layout computed from the pruned k nearest neighbour graph, and an integer number `residual.cluster`. If clusters with less than `min.size` (input parameter to `graphCluster`) were obtained, all these clusters are aggregated into the cluster with the largest cluster number in the partition. If this aggregation was done, the cluster nuber is stored in `residual.cluster` (which is `NULL` otherwise). By setting the `use.weights` argument of `graphCluster` to `TRUE` (default), clustering can be performed on the pruned knn graph with weighted links, where weights correspond to the link probabilities. This may lead to the emergence of too many clusters (which could happen, e.g., due to variability in the link probabilities when integrating different experimental batches). In this case, the `use.weights` parameter can be set to `FALSE`, and clustering is performed on a pruned knn graph with equal weights for all links. The `pvalue` parameter determines the probability cutoff for link pruning, i.e., links with probabilities lower than `pvalue` are removed. To visualize the clustering with the help of RaceID, the `SCseq` object can be updated with the results: ```{r} sc <- updateSC(sc,res=res,cl=cl) ``` This function call will initialize the `cpart` and the `cluster$kpart` slots with the clustering partition in `cl$partition`, and the `fr` slot with the Fruchterman-Rheingold layout. The clustering can readily be inspected in the Fruchterman-Rheingold layout: ```{r} plotmap(sc,fr=TRUE) ``` After computing a t-SNE map, the clustering can be also be visualized in t-SNE space: ```{r eval=FALSE} sc <- comptsne(sc,perplexity=50) plotmap(sc) ``` Alternatively, a umap representation can be computed for visualization: ```{r} sc <- compumap(sc,min_dist=0.5) plotmap(sc,um=TRUE) ``` The pruned k nearest neighbour graph allows the derivation of transition probabilities between clusters, based on the remaining links connecting two clusters (at a minimum probability given by `pvalue`) and the inferred probabilities of these links: ```{r} probs <- transitionProbs(res,cl,pvalue=0.01) ``` If RaceID is used for visualization and has been updated with the clustering using the `updateSC` function (see above), the transition probabilities can be visualized in the dimensional reduction representation (`fr=TRUE` for Fruchterman-Rheingold or `um=TRUE` for umap representation, default is t-SNE): ```{r} plotTrProbs(sc,probs,um=TRUE) ``` The argument `prthr` allows discarding all links with a transition probability `<= prthr` and the th `cthr` argument sets a cluster size threshold. Only clusters with `> cthr` cells are shown. The Individual neighbourhood of a cell, e.g, cell `i=20`, can be inspected with the `inspectKnn` function, which returns a list object with information on gene expression and outlier p-values across all original k nearest neighbours. If the function is called with `plotSymbol=TRUE`, a dimensional reduction representation is returned (`um=TRUE` for umap), highlighting nearest neighbours and outliers which are subject to pruning: ```{r warning = FALSE} nn <- inspectKNN(20,expData,res,cl,object=sc,pvalue=0.01,plotSymbol=TRUE,um=TRUE,cex=1) ``` One element of the returned list object is an outlier p-value across all k nearest neighbours: ```{r warning = FALSE} head(nn$pv.neighbours) ``` Another element contains the corresponding normalized gene expression counts: ```{r warning = FALSE} head(nn$expr.neighbours) ``` If the same function is executed with `plotSymbol=FALSE`, the local mean-variance dependence is plotted along with a loess-regression, a second order polynomial fit (used for pruning), and the background model of the local variability (used for biological noise inference, see below): ```{r warning = FALSE} nn <- inspectKNN(20,expData,res,cl,object=sc,pvalue=0.01,plotSymbol=FALSE) ``` Alternatively, the coefficient of variation (CV) can be plotted with the same model fits: ```{r eval = FALSE} nn <- inspectKNN(20,expData,res,cl,object=sc,pvalue=0.01,plotSymbol=FALSE,cv=TRUE) ``` The plot also highlights the local variability associated with cell-to-cell variability of total transcript counts, as calculated directly from the mean and variance of total transcript counts (turquoise) or from a local fit of a gamma distribution (orange). The inference of homogenous local neighbourhoods, given by a cell and its remaining nearest neighbours after pruning, permits the computation of local quantities across such sets of similar cells. To investigate the dynamics of gene expression noise across cell states, we derive local estimates of gene expression variability. The systematic dependence of gene expression variance on the mean expression is a central challenge for the derivation of differential variability between cell states. VarID2 solves this problem by deconvoluting binomial sampling noise, cell-to-cell variability in total transcript counts (as a consequence of cell-to-cell variability in sequencing efficiency, but also due to global biological variability in total RNA content, e.g., during different phases of the cell cycle), and residual variability, which we address as biological noise. This deconvolution is achieved by fitting a Gamma distribution to the local disrtibution of total transcript counts, yielding a technical noise paramater `rt` for each central cell, followed by a maximum a posterior inference of the reidual variability, i.e., biological noise, `epsilon` for each gene in every neighbourhood. The local mean expression `mu` within each neighbourhood is computed as the average of non-normalized transcript count across pruned local neighbourhoods. The noise inference can be done on a transcript count matrix, filtered by expression, e.g., retaining only cells with a minimum of 5 UMI counts in at least 20 cells: ```{r} x <- getFilteredCounts(sc,minexpr=5,minnumber=20) noise <- compTBNoise(res,x,pvalue=0.01,gamma = 0.5,no_cores=1) ``` The most important argument of the `compTBNoise` is the `gamma` parameter which determines the strength of the prior for the suppression of spurious inflation of noise levels at low expression. The `p-value` argument again determines the link probability threshold for pruning. For neighbourhoods with zero transcript count noise estimates are replaced by `NA` in `noise$epsilon`. The prior parameter `gamma` should be chosen such that the correlation between mean noise across cells and total UMI count per cell is minimal. This dependence can be analysed with the function `plotUMINoise`: ```{r} plotUMINoise(sc,noise,log.scale=TRUE) ``` If a positive correlation is observed, `gamma` should be increased in order to weaken the prior. If the correlation is negative, `gamma` should be decreased in order to increase the strength of the prior. The RaceID `SCseq` object can be updated with the biological noise estimates: ```{r} sc <- updateSC(sc,res=res,cl=cl,noise=noise) ``` This initializes a slot `noise` of the `SCseq` object with a gene-by-cell matrix of the biological noise estimates `noise$epsilon`. The noise levels of a given gene (or group of genes), can then be visualized in the same way as gene expression with the help of RaceID functions. Gene expression (on logarithmic scale): ```{r} plotexpmap(sc,"Clca4",logsc=TRUE,um=TRUE,cex=1) ``` Biological noise (on logarithmic scale): ```{r} plotexpmap(sc,"Clca4",logsc=TRUE,um=TRUE,noise=TRUE,cex=1) ``` This plot indicates that cluster 2 comprises Clca4+ intestinal stem cells. A scatter plot of noise versus gene expression allows inspection of the dependence between noise and gene expression: ```{r} plotExpNoise("Clca4",sc,noise,norm=TRUE,log="xy") ``` The biological noise can also be plotted in a cluster heatmap and directly compared to gene expression. After deriving a gene expression heatmap, the order of genes stored in `ph` (returned by the `pheatmap` function of the `pheatmap` package), can be applied to the noise heatmap in order to permit direct comparability: ```{r} genes <- c("Lyz1","Agr2","Clca3","Apoa1","Aldob","Clca4","Mki67","Pcna") ph <- plotmarkergenes(sc,genes=genes,noise=FALSE) plotmarkergenes(sc,genes=genes[ph$tree_row$order],noise=TRUE,cluster_rows=FALSE) ``` Based on marker gene expression, clusters 1, 5, and 6, can be annotated as enterocytes, goblet and Paneth cells. The gene expression can also be visualized in a dot plot highlighting the fraction of cells in a cluster expressing a gene (size of dots) together with the average expresson level (dot color): ```{r} fractDotPlot(sc, genes, zsc=TRUE) ``` With the clustering derived by `graphCluster` and the `noise` object returned by the `compTBNoise` function, genes with differential noise levels in a cluster (or a set of clusters) `set` versus a background set `bgr` (by default all clusters not in `set`) can be inferred: ```{r} ngenes <- diffNoisyGenesTB(noise,cl,set=1,no_cores=1) head(ngenes) ``` The output lists average (non-normalized) expression `mu` and noise levels `eps` for clusters in `set` and `bgr` as well as the log2 fold change in noise levels, `log2FC`, between `set` and `bgr` and a multiple testing-corrected p-value (Bonferroni correction, multiplied by the maximal number `knn` of nearest neighbours). This functions applies a Wilcoxon test to compare variability in a cluster (or set of clusters) versus the remaining clusters or a given background set of clusters. To avoid spurious signals with small effect size, a uniform random variable (by default, from `[0:ps]` with `ps=0.1`) is added to the noise estimates. The top varaible genes can be plotted in a heatmap: ```{r} genes <- head(rownames(ngenes),50) ph <- plotmarkergenes(sc,genes=genes,noise=TRUE,cluster_rows=FALSE,zsc=TRUE) ``` One can also cluster the cells and/or genes in the heatmap based on the noise quantification: ```{r eval=FALSE} ph <- plotmarkergenes(sc,genes=genes,noise=TRUE,cluster_rows=TRUE,cluster_cols=TRUE) ``` The gene expression itself can be represented with the same order of cells and genes by utilizing the `ph` output: ```{r eval=FALSE} plotmarkergenes(sc,genes=ph$tree_row$labels[ ph$tree_row$order ],noise=FALSE,cells=ph$tree_col$labels[ ph$tree_col$order ], order.cells=TRUE,cluster_rows=FALSE) ``` It is also possible to order genes by their noise levels across a (set of) cluster or the entire dataset, if the `cl` argument is omitted: ```{r} mgenes <- maxNoisyGenesTB(noise,cl=cl,set=3) head(mgenes) plotmarkergenes(sc,genes=head(names(mgenes),50),noise=TRUE) ``` Differential noise between clusters in `set` and `bgr` can also be visualized as MA plot: ```{r} ngenes <- diffNoisyGenesTB(noise, cl, set=1, bgr=c(2,4)) plotDiffNoise(ngenes) ``` For comparison, differentially expressed genes between the same groups can be inferred and visualized in a similar manner: ```{r eval=FALSE} dgenes <- clustdiffgenes(sc,1,bgr=c(2,4),pvalue=0.01) plotdiffgenesnb(dgenes,xlim=c(-6,3)) ``` The distribution of gene expression (for individual genes or groups of genes) across a set of clusters can be visualized as violin plots: ```{r} violinMarkerPlot(c("Mki67","Pcna"),sc,set=c(2,3,1)) ``` Similarly, the distribution of gene expression noise can be plotted: ```{r} violinMarkerPlot(c("Mki67","Pcna"),sc,noise,set=c(2,3,1)) ``` To further investigate transcriptome variability and related quantities, the function `quantKnn` allows computation of average noise levels across cells, cell-to-cell transcriptome correlation, and total UMI counts. ```{r} qn <- quantKnn(res, noise, sc, pvalue = 0.01, minN = 5, no_cores = 1) ``` These quantities can be inspected in the dimensional reduction representation, or as boxplots across cluster, with the level of a selected reference cluster highlighted as horizontal line. For instance, the Lgr5+ stem cell cluster 2 can be used as a reference: ```{r} StemCluster <- 2 ``` Average biological noise across all genes: ```{r} plotQuantMap(qn,"noise.av",sc,um=TRUE,ceil=.6,cex=1) plotQuantMap(qn,"noise.av",sc,box=TRUE,cluster=StemCluster) ``` Avergae Spearman's transcriptome correlation across all cells within pruned k nearest neighbourhoods: ```{r} plotQuantMap(qn,"local.corr",sc,um=TRUE,logsc=TRUE,cex=1) plotQuantMap(qn,"local.corr",sc,box=TRUE,logsc=TRUE,cluster=StemCluster) ``` Total UMI count averaged across cells within pruned k nearest neighbourhoods: ```{r} plotQuantMap(qn,"umi",sc,um=TRUE,logsc=TRUE,cex=1) plotQuantMap(qn,"umi",sc,box=TRUE,logsc=TRUE,cluster=StemCluster) ``` Dependencies between different quantities can be inspected in a scatter plot, highlighting a cluster of interest: For instance, the dependency between biological noise and UMI counts: ```{r} plotQQ(qn,"umi","noise.av",sc,cluster=StemCluster,log="yx",cex=1) ``` A residual dependency of the noise estimate on the total UMI count may indicate that the prior is either not strong enough to suppress noise inflation at low expression (negative slope) or too strong (positive slope). In this case, the parameter `gamma` of the function `compTBNoise` should be increased or decreased, respectively. Dependency between transcriptome correlation and average biological noise: ```{r} plotQQ(qn,"local.corr","noise.av",sc,cluster=StemCluster,log="xy",cex=1) ``` To study dynamics of gene expression and biological noise during differentiation, VarID2 facilitates pseudo-time analysis. Selection of a linear trajectry connecting a defined set of clusters could be based on the transition probabilities obtained from VarID2. ```{r} plotTrProbs(sc,probs,um=TRUE) ``` For examples, clusters 1, 5, and 3, represent the differentiation trajectory from Lgr5+ stem cell towards Apoa1+ enterocytes. ```{r} plotexpmap(sc,"Apoa1",um=TRUE,cex=1,logsc=TRUE) ``` To order cells pseudo-temporally, VarID2 offers a wrapper function `pseudoTime`, which calls the Slingshot method [@slingshot] on a desired dimensional reduction, which can be a subset from the precomputed dimensional reductions of the RaceID object (`sc@fr`, `sc@tsne`, or `sc@umap`, given by the `map` parameter), or a dimensional reduction (t-SNE or UMAP) computed for the subset of cells on the trajectory. This set is assumed to be ordered along the trajectory, with the initial cluster as first entry. This function is not designed to recover branched trajetories; single lineages could be defined based on the output of the `plotTrPtobs` function as an ordered set of clusters. `pseudoTime` will always fit a single trajectory to the clusters specified in `set` according to the indicated order. The slingsot package needs to be installed from Bioconductor. If the package is not installed, the function falls back to principal curve inference using the `princurve` package. ```{r results='hide', message=FALSE} # ordered set of clusters on the trajectory set <- c(2,3,1) # if slingshot is available, run with useSlingshot=TRUE (default) pt <- pseudoTime(sc,m="umap",set=set,useSlingshot=FALSE) ``` Inferred pseudo-time can be highlighted in the inferred dimensional reduction representation: ```{r} plotPT(pt,sc,clusters=FALSE) ``` The trajectory can also be overlaid with the cluster annotation: ```{r} plotPT(pt,sc,clusters=TRUE,lineages=TRUE) ``` With the derived pseudo-time a filtered count matrix with cells ordered by pseudo-time can be obtained: ```{r} fs <- extractCounts(sc,minexpr=5,minnumber=20,pt=pt) ``` Using methods from the `FateID` package, pseudo-time expression profiles can be derived by self-organizing maps (SOM) and grouped into modules: ```{r results='hide', message=FALSE, warnings=FALSE} library(FateID) s1d <- getsom(fs,nb=50,alpha=1) ps <- procsom(s1d,corthr=.85,minsom=0) part <- pt$part ord <- pt$ord plotheatmap(ps$all.z, xpart=part[ord], xcol=sc@fcol, ypart=ps$nodes, xgrid=FALSE, ygrid=TRUE, xlab=TRUE) ``` Individual pseudo-time expression profiles can be plotted for genes of interest, e.g., ```{r} plotexpression(fs,y=part,g="Apoa1",n=ord,col=sc@fcol,cex=1,alpha=1) ``` It is also possible to plot profiles for a group of genes into the same window: ```{r} genes <- c("Mki67","Pcna","Apoa1") plotexpressionProfile(fs,y=part,g=genes,n=ord,alpha=1,col=rainbow(length(genes)),lwd=2) ``` From the SOM all genes within a given module, e.g., module 1, can be extracted: ```{r} genes <- getNode(ps,1) plotexpressionProfile(fs,y=part,g=head(genes,10),n=ord,alpha=1,lwd=2) ``` Similarly, gene modules with correlated noise profiles can be derived: ```{r} fsn <- extractCounts(sc,minexpr=5,minnumber=20,pt=pt,noise=TRUE) s1dn <- getsom(fsn,nb=25,alpha=1) psn <- procsom(s1dn,corthr=.85,minsom=0) ``` ```{r eval=FALSE} plotheatmap(psn$all.z, xpart=part[ord], xcol=sc@fcol, ypart=ps$nodes, xgrid=FALSE, ygrid=TRUE, xlab=TRUE) ``` Noise profiles of individual genes or groups of genes: ```{r} plotexpression(fsn,y=part,g="Apoa1",n=ord, col=sc@fcol,cex=1,alpha=1,ylab="Noise") genes <- c("Mki67","Pcna","Apoa1") plotexpressionProfile(fsn,y=part,g=genes,n=ord,alpha=1,col=rainbow(length(genes)),lwd=2,ylab="Noise") genes <- getNode(psn,1) plotexpressionProfile(fsn,y=part,g=head(genes,10),n=ord,alpha=1,lwd=2,ylab="Noise") ``` For further details on plotting functions for pseudo-time analysis see `vignette("FateID")`. Gene expression and noise modules can now be co-analyzed to characterize the inter-dependence of gene expression and noise dynamics during differentiation. ## Removing batch effects and other confounding factors with VarID2 VarID2 offers the possibility to remove batch effects and other confounding sources of variability. To achieve this, batch variables and other latent variables can be included as dependent variables in the negative binomial regression, utilized in the `pruneKnn` function if the parameter `regNB` is set to `TRUE`. Alternatively, batch effect correction can ben done by the `harmony` package, if the `bmethod` parameter of `pruneKnn` is set to `"harmony"`. These batch correction options only exist if `large` is set to `TRUE`. The `batch` parameter is given by a categorical vector of batch names. ```{r eval=FALSE} sc <- SCseq(intestinalData) sc <- filterdata(sc,mintotal=1000,FGenes=grep("^Gm\\d",rownames(intestinalData),value=TRUE),CGenes=grep("^(mt|Rp(l|s))",rownames(intestinalData),value=TRUE)) expData <- getExpData(sc) batch <- sub("5d.+","",colnames(expData)) names(batch) <- colnames(expData) head(batch) require(Matrix) S_score <- colMeans(sc@ndata[intersect(cc_genes$s,rownames(sc@ndata)),]) G2M_score <- colMeans(sc@ndata[intersect(cc_genes$g2m,rownames(sc@ndata)),]) regVar <- data.frame(S_score=S_score, G2M_score=G2M_score) rownames(regVar) <- colnames(expData) res <- pruneKnn(expData,no_cores=1,batch=batch,regVar=regVar) cl <- graphCluster(res,pvalue=0.01) probs <- transitionProbs(res,cl) x <- getFilteredCounts(sc,minexpr=5,minnumber=5) noise <- compTBNoise(res,x,pvalue=0.01,no_cores=1) sc <- updateSC(sc,res=res,cl=cl,noise=noise) sc <- compumap(sc,min_dist=0.5) sc <- comptsne(sc,perplexity=50) plotmap(sc,cex=1) plotmap(sc,fr=TRUE,cex=1) plotmap(sc,um=TRUE,cex=1) plotsymbolsmap(sc,batch,um=TRUE,cex=1) plotexpmap(sc,"Mki67",um=TRUE,cex=1,log=TRUE) plotexpmap(sc,"Pcna",um=TRUE,cex=1,log=TRUE) ``` To regress out additional latent variables associated with distinct phases of the cell cycle, the parameter `regVar` was initialized with a data.frame of cell cycle S phase and G2/M phase scores (aggregated normalized gene expression of cell cycle phase markers). Column names of `regVar` indicate the latent variable names, and row names correspond to cell IDs, i.e. column names of `expData`. Interaction terms between the batch variable and the UMI count or the other latent variables in `regVar` are included. See previous paragraphs for more advice on how to explore the output data. ## Integrating Seurat with VarID2/RaceID3 `Seurat` object can be converted into an \pkg{RaceID}`SCseq` object for VarID2 analysis, using the function `Seurat2SCseq`. This function expects a class `Seurat` object from the \pkg{Seurat} package as input and converts this into a \pkg{RaceID} `SCseq` object. The function transfers the counts, initializes `ndata` and `fdata` without further filtering, transfers the PCA cell embeddings from the `Seurat` object to `dimRed`, transfers the clustering partition, and `umap` and `tsne` dimensional reduction (if available). CAUTION: Cluster numbers in `RaceID` start at 1 by default. Hence, all `Seurat` cluster numbers are shifted by 1. ```{r eval=FALSE} library(Seurat) library(RaceID) Se <- CreateSeuratObject(counts = intestinalData, project = "intestine", min.cells = 3, min.features = 200) Se <- NormalizeData(Se, normalization.method = "LogNormalize", scale.factor = 10000) Se <- FindVariableFeatures(Se, selection.method = "vst", nfeatures = 2000) all.genes <- rownames(Se) Se <- ScaleData(Se, features = all.genes) Se <- RunPCA(Se, features = VariableFeatures(object = Se)) Se <- RunUMAP(Se, dims = 1:30, verbose = FALSE) Se <- RunTSNE(Se, dims = 1:30, verbose = FALSE) Se <- FindNeighbors(Se, dims = 1:10) Se <- FindClusters(Se, resolution = 0.5) DimPlot(Se, label = TRUE) + NoLegend() res <- pruneKnn(Se) ## without pruning (fast) ## res <- pruneKnn(Se,do.prune=FALSE,no_cores=1) sc <- Seurat2SCseq(Se) plotmap(sc,um=TRUE) noise <- compTBNoise(res,getExpData(sc),no_cores=1) sc <- updateSC(sc,noise=noise) plotexpmap(sc,"Clca4",um=TRUE,cex=1,log=TRUE) plotexpmap(sc,"Clca4",um=TRUE,noise=TRUE,cex=1) ``` See previous paragraphs for more advice on how to explore the output data. # FAQ For bug reports and any questions related to RaceID and StemID please email directly to [link](mailto:dominic.gruen@gmail.com). 1. How can I perform RaceID/VarID2 analysis on 10x Genomics Chromium Single Cell RNA-Seq data? Processed count data from the Cell Ranger pipeline (or equivalent formats) can be read from the working directory (or any other path) and transformed into a sparse count matrix, which can be loaded into an `SCseq` object. This requires the `Matrix` package, which is installed as a dependency together with `RaceID`. The following chunk of code can be used: ```{r eval=FALSE} require(Matrix) require(RaceID) x <- readMM("matrix.mtx") f <- read.csv("features.tsv",sep="\t",header=FALSE) b <- read.csv("barcodes.tsv",sep="\t",header=FALSE) rownames(x) <- f[,1] colnames(x) <- b[,1] sc <- SCseq(x) ``` 2. How should I analyze large datasets (>25k cells)? Datasets with >25k cells should be analyzed with VarID2 instead of RaceID, which is more suitable for large datasets and does not require a distance matrix. The memory requirement of a distance matrix scales with N^2 as a function of the number of cells N. VarID2 should be run with with parameters `large=TRUE` and `regNB=TRUE` of the `pruneKnn` function in order to achieve optimal performance (default parameters). To increase processing speed VarID2 can be run with multiple cores. However, each core requires additional memory, and with too many cores VarID2 will run out of memory. To reduce computation time, the number of genes used for the negative binomial regression of total UMI count per cell can be limited to, e.g., 2000 by setting the `ngenes` parameter. Example for mouse data, including filtering of mitochondrial genes, ribosomal genes, and Gm* genes (and genes correlating to these groups): ```{r eval=FALSE} require(Matrix) require(RaceID) x <- readMM("matrix.mtx") f <- read.csv("features.tsv",sep="\t",header=FALSE) b <- read.csv("barcodes.tsv",sep="\t",header=FALSE) rownames(x) <- f[,1] colnames(x) <- b[,1] sc <- SCseq(x) sc <- filterdata(sc,mintotal=1000,CGenes=rownames(x)[grep("^(mt|Rp(l|s)|Gm\\d)",rownames(x))]) expData <- getExpData(sc) res <- pruneKnn(expData,no_cores=5) cl <- graphCluster(res,pvalue=0.01) probs <- transitionProbs(res,cl) ## compute noise from corrected variance noise <- compTBNoise(res,expData,pvalue=0.01,no_cores=5) sc <- updateSC(sc,res=res,cl=cl,noise=noise) sc <- comptsne(sc) sc <- compumap(sc) ``` 3. What can I do if too many clusters were detected in my dataset? This could be a consequence of technical noise, e.g., if experimental batches with strong differences in total UMI counts were co-analyzed. In such cases the background models of RaceID or VarID2 could indicate the presence of many spurious outliers. In RaceID, the number of outliers can be reduced by setting the the `probthr` parameter of the `findoutliers` function to values closer to zero. This is the probability threshold for outlier detection. In VarID2 the `pvalue` parameter of the `graphCluster` function can be set to lower values. This is the probability threshold for pruning links of the knn network. In the extreme case that no outliers (or pruning) are desired, `probthr` (or `pvalue`) should be set to zero. Moreover, the `use.weights` parameter of the `graphCluster` function can be set to `FALSE`; in this case, Louvain clustering is performed on a (pruned) knn-network with equal weights for all links. By default, links are weighted by the link probability, and if strong technical differences lead to low link probabilities, this could be the reason for the emergence of too many clusters. # References --- references: - id: RaceID type: article-journal author: - family: Grun given: D. - family: Lyubimova given: A. - family: Kester given: L. - family: Wiebrands given: K. - family: Basak given: O. - family: Sasaki given: N. - family: Clevers given: H. - family: Oudenaarden given: A. dropping-particle: van issued: - year: '2015' month: '9' title: '[Single-cell messenger RNA sequencing reveals rare intestinal cell types]{.nocase}' container-title: Nature page: '251-255' volume: '525' issue: '7568' - id: mnnCorrect type: article-journal author: - family: Haghverdi given: L. - family: Lun given: A. T. L. - family: Morgan given: M. D. - family: Marioni given: J. C. issued: - year: '2018' month: '6' title: '[Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors]{.nocase}' container-title: Nat. Biotechnology page: '421-427' volume: '36' issue: '5' - id: FateID type: article-journal author: - family: Herman given: J. S. - family: Sagar given: - family: Grun given: D. issued: - year: '2018' month: '5' title: '[FateID infers cell fate bias in multipotent progenitors from single-cell RNA-seq data]{.nocase}' container-title: Nat. Methods page: '379-386' volume: '15' issue: '5' - id: Noise type: article-journal author: - family: Grun given: D. - family: Kester given: L. - family: Oudenaarden given: A. dropping-particle: van issued: - year: '2014' month: '6' title: '[Validation of noise models for single-cell transcriptomics]{.nocase}' container-title: Nat. Methods page: '637-640' volume: '11' issue: '6' - id: Census type: article-journal author: - family: Qiu given: X. - family: Hill given: A. - family: Packer given: J. - family: Lin given: D. - family: Ma given: Y. A. - family: Trapnell given: C. issued: - year: '2017' month: '3' title: '[Single-cell mRNA quantification and differential analysis with Census]{.nocase}' container-title: Nat. Methods page: '309-315' volume: '14' issue: '3' - id: StemID type: article-journal author: - family: Grun given: D. - family: Muraro given: M. J. - family: Boisset given: J. C. - family: Wiebrands given: K. - family: Lyubimova given: A. - family: Dharmadhikari given: G. - family: Born given: M. dropping-particle: van den - family: Es given: J. dropping-particle: van - family: Jansen given: 'E.' - family: Clevers given: H. - family: Koning given: E. J. P. dropping-particle: de - family: Oudenaarden given: A. dropping-particle: van issued: - year: '2016' month: '8' title: '[De Novo Prediction of Stem Cell Identity using Single-Cell Transcriptome Data]{.nocase}' container-title: Cell Stem Cell page: '266-277' volume: '19' issue: '2' - id: DESeq type: article-journal author: - family: Anders given: S. - family: Huber given: W. issued: - year: '2010' title: '[Differential expression analysis for sequence count data]{.nocase}' container-title: Genome Biol. page: R106 volume: '11' issue: '10' - id: DESeq2 type: article-journal author: - family: Love given: M. I. - family: Huber given: W. - family: Anders given: S. issued: - year: '2014' title: '[Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2]{.nocase}' container-title: Genome Biol. page: '550' volume: '15' issue: '12' - id: RandomForests type: article-journal author: - family: Breiman given: L. issued: - year: '2001' title: '[Random Forests]{.nocase}' container-title: Mach. Learn. page: '5-32' volume: '45' issue: '1' - id: TSNE type: article-journal author: - family: Maaten given: L. dropping-particle: van der - family: Hinton given: G. issued: - year: '2008' title: '[Visualizing Data using t-SNE]{.nocase}' container-title: J. Mach. Learn. page: '2570-2605' volume: '9' - id: DPT type: article-journal author: - family: Haghverdi given: L. - family: Buttner given: M. - family: Wolf given: F. A. - family: Buettner given: F. - family: Theis given: F. J. issued: - year: '2016' month: '10' title: '[Diffusion pseudotime robustly reconstructs lineage branching]{.nocase}' container-title: Nat. Methods page: '845-848' volume: '13' issue: '10' - id: Destiny type: article-journal author: - family: Angerer given: P. - family: Haghverdi given: L. - family: Buttner given: M. - family: Theis given: F. J. - family: Marr given: C. - family: Buettner given: F. issued: - year: '2016' month: '4' title: '[destiny: diffusion maps for large-scale single-cell data in R]{.nocase}' container-title: Bioinformatics page: '1241-1243' volume: '32' issue: '8' - id: sctransform type: article-journal author: - family: Hafemeister given: C. - family: Satija given: R. issued: - year: '2019' month: '12' title: '[Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression]{.nocase}' container-title: Genome Biol. page: '296' volume: '20' issue: '1' - id: offSetModel type: article-journal author: - family: Lause given: J. - family: Berens given: P. - family: Kobak given: D. issued: - year: '2020' month: '12' title: '[Analytic Pearson residuals for normalization of single-cell RNA-seq UMI data]{.nocase}' container-title: Genome Biol. page: '258' volume: '22' issue: '1' - id: slingshot type: article-journal author: - family: Street given: K. - family: Risso given: D. - family: Fletcher given: R.B. - family: Das given: D. - family: Ngai given: J. - family: Yosef given: N. - family: Purdom given: E. - family: Dudoit given: S. issued: - year: '2018' month: '6' title: '[Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics]{.nocase}' container-title: BMC Genomics page: '477' volume: '19' - id: VarID type: article-journal author: - family: Grun given: D. issued: - year: '2020' month: '1' title: '[Revealing dynamics of gene expression variability in cell state space]{.nocase}' container-title: Nat. Methods page: '45-49' volume: '17' issue: '1' - id: VarID2 type: article-journal author: - family: Rosalez-Alvarez given: R.E. - family: Rettkowski given: J. - family: Herman given: J.S. - family: Dumbovic given: G. - family: Cabezas-Wallscheid given: N. - family: Grun given: D. issued: - year: '2023' month: '8' title: '[Gene expression noise dynamics unveil functional heterogeneity of ageing hematopoietic stem cells]{.nocase}' container-title: Genome Biol. page: '148' volume: '24' issue: '1' ...