---
title: "CRISPR Screen and Gene Expression Differential Analysis"
author: "Lianbo Yu, Yue Zhao, Kevin R. Coombes, and Lang Li"
date: "`r Sys.Date()`"
output:
  pdf_document:
    latex_engine: xelatex
    number_sections: yes
    toc: yes
vignette: >
  %\VignetteIndexEntry{CRISPR Screen and Gene Expression Differential Analysis}
  %\VignetteEncoding{UTF-8}
  %\VignetteDepends{CEDA} 
  %\VignettePackage{CEDA} 
  %\VignetteEngine{knitr::rmarkdown}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  warning = FALSE,
  message = FALSE,
  collapse = TRUE,
  comment = "#>"
)
```


# Introduction
We developed CEDA to analyze read counts of single guide RNAs (sgRNAs) from CRISPR screening experiments. Except for non-targeting sgRNAs (can be used as negative controls), each sgRNA is targeting one gene, and each gene have multiple sgRNAs targeting it. CEDA models the sgRNA counts at different levels of gene expression by multi-component normal mixtures, with the model fit by an EM algorithm. Posterior estimates at sgRNA level are then summarized for each gene.

In this document, we use data from an experiment with the MDA231 cell line to illustrate how to use CEDA to perform CRISPR screen data analysis.

# Overview
CEDA analysis follows a workflow that is typical for most omics level experiments.

1. Put the data into an appropriate format for input to CEDA.
2. Filter out sgRNA with low read counts (optional).
3. Normalize the raw counts.
4. Fit a linear model to the data.
5. Summarize and view the results. 

## Data Format
In our experiment, three samples of MDA231 cells were untreated at time T=0,
and another three samples of MDA231 cells were treated with DMSO at time T=0.
We are interested in detecting sgRNAs that are differentially changed by a treatment. 

The sgRNA read counts, along with a list of non-essential genes, are stored 
in the dataset `mda231` that we have included in the `CEDA` package. We read that
dataset and explore its structure.
```{r data}
library(CEDA)
library(dplyr)
library(ggplot2)
library(ggsci)
library(ggprism)
library(ggridges)
set.seed(1)
data("mda231")
class(mda231)
length(mda231)
names(mda231)
```
As you can see, this is a list containing two components

1. `sgRNA`, the observed count data of six samples, and
2. `neGene`, the set of non-essential genes.

```{r sgRNA}
dim(mda231$sgRNA)
length(mda231$neGene$Gene)
head(mda231$sgRNA)
```
Notice that the `sgRNA` component includes an extra column, "`exp.level.log2`", that
are the expression level (in log2 scale) of genes and was computed from gene 
expression data in the unit of FPKM.

The second component of the list `mda231` is a data frame `neGene`, which is the gene names of the non-essential genes(Ref 1):
```{r neGene}
dim(mda231$neGene)
head(mda231$neGene)
```

## Filter out sgRNAs with low read counts
Due to the nature of drop-out screen, some sgRNAs targeting essential genes have low read counts at the end time point samples. Therefore, we must be cautious when applying strict filtering criteria to remove a large portion of sgRNAs before analysis. Gentle filtering criteria (i.e., removal of sgRNAs with 0 count in ≥ 75% samples) is recommended.
```{r filter}
count_df <- mda231$sgRNA
mx.count = apply(count_df[,c(3:8)],1,function(x) sum(x>=1))
table(mx.count)
# keep sgRNA with none zero count in at least one sample
count_df2 = count_df[mx.count>=1,]
```

## Normalization
The sgRNA read counts needs to be normalized across sample replicates before 
formal analysis. The non-essential genes are assumed to have no change after 
DMSO treatment. So, our recommended procedure is to perform median normalization
based on the set of non-essential genes.
```{r normalization}
mda231.ne <- count_df2[count_df2$Gene %in% mda231$neGene$Gene,]
cols <- c(3:8)
mda231.norm <- medianNormalization(count_df2[,cols], mda231.ne[,cols])[[2]]
```

## Analysis
Our primary goal is to detect essential sgRNAs that have different count levels 
between conditions. We rely on the R package `limma` to calculate log2 ratios (i.e., log fold changes or LFCs) between three untreated and three treated samples. 

### Calculating fold ratios
First, we have to go through the usual `limma` steps to describe the design of
the study. There were two groups of replicate samples. We will call these groups
"Control" and "Baseline" (although "Treated" and Untreated" would work just as well).
Our main interest is determining the differences between the groups. And we have to
record this information in a "contrast matrix" so limma knows what we want to compare.
```{r design}
group <- gl(2, 3, labels=c("Control","Baseline"))
design <- model.matrix(~  0 + group)
colnames(design) <- sapply(colnames(design), function(x) substr(x, 6, nchar(x)))
contrast.matrix <- makeContrasts("Control-Baseline", levels=design)
```
Finally, we can run the limma algorithm.
```{r limfit}
limma.fit <- runLimma(log2(mda231.norm+1),design,contrast.matrix)
```

We merge the results from our limma analysis with the post filtering sgRNA count data.
```{r merge}
mda231.limma <- data.frame(count_df2, limma.fit)
head(mda231.limma)
```

### Fold ratios under the null hypotheses
Under the null hypothses, all sgRNAs levels are unchanged between the
two conditions. To obtain fold ratios under the null, samples were 
permuted between two conditions, and log ratios were obtained from
limma analysis under each permutation.
```{r betanull}
betanull <- permuteLimma(log2(mda231.norm + 1), design, contrast.matrix, 10)
theta0 <- sd(betanull)
theta0
```

### Fitting three-component mixture models
A three-component mixture model (unchanged, overexpresssed, and underexpressed)
is assumed for log ratios at different level of gene expression. Empirical Bayes method was employed to estimate parematers of the mixtures and posterior means were obtained for estimating actual log ratios between the two conditions. P-values of sgRNAs were then calculated by permutation method.

```{r mm, results='hide'}
nmm.fit <- normalMM(mda231.limma, theta0, n.b=3, d=5)
```

Results from the mixture model were shown in Figure $1$. False 
discovery rate of $0.05$ was used for declaring significant changes in red
color between the two conditions for sgRNAs. The vertical lines are dividing sgRNAs into bins accroding to the gene expression levels of their targetted genes.

```{r fig1, fig.cap = "Log fold ratios of sgRNAs vs. gene expression level"}
scatterPlot(nmm.fit$data,fdr=0.05,xlim=c(-0.5,12),ylim=c(-8,5))
```

### Gene level summarization
From the p-values of sgRNAs, gene level p-values were obtained by using modified robust rank aggregation method (alpha-RRA). Log ratios were also summarized at gene level.
```{r pval}
mda231.nmm <- nmm.fit[[1]]
p.gene <- calculateGenePval(exp(mda231.nmm$log_p), mda231.nmm$Gene, 0.05, nperm=10)
gene_fdr <- stats::p.adjust(p.gene$pvalue, method = "fdr")
gene_lfc <- calculateGeneLFC(mda231.nmm$lfc, mda231.nmm$Gene)

gene_summary <- data.frame(gene_pval=unlist(p.gene$pvalue), gene_fdr=as.matrix(gene_fdr), gene_lfc = as.matrix(gene_lfc))
gene_summary$gene <- rownames(gene_summary)
gene_summary <- gene_summary[,c(4,1:3)]
```

### Gene level summarization
From the p-values of sgRNAs, gene level p-values were obtained by using modified robust rank aggregation method (alpha-RRA). Log ratios were also summarized at gene level.
```{r summary}
#extract gene expression data
gene.express <- mda231.nmm %>% group_by(Gene) %>% summarise_at(vars(exp.level.log2), max)
#merge gene summary with gene expression
gdata <- left_join(gene_summary, gene.express, by = c("gene" = "Gene"))
gdata <- gdata %>% filter(is.na(exp.level.log2)==FALSE)
# density plot and ridge plot
gdata$gene.fdr <- gdata$gene_fdr
data <- preparePlotData(gdata, gdata$gene.fdr)
```

Results from CEDA were shown in Figure $2$. The points in the scatter plot were stratified into five color groups based on FDR. The 2D contour lines showed how the points distributed in 3D space. 
```{r fig2, fig.cap = "2D density plot of gene log fold ratios vs. gene expression level for different FDR groups"}
densityPlot(data)
```

The density plot of CEDA results shows that the groups of genes selected by CEDA (FDR<0.05, yellow, blue, and red) showed higher expression median values compared to the rest of genes (purple, green).