---
title: "SeqKat"
author: "Fouad Yousif, Xihui Lin, Fan Fan, Christopher Lalansingh"
date: "`r Sys.Date()`"
output: pdf_document
vignette: >
  %\VignetteIndexEntry{SeqKat}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

## Background
Kataegis is a localized hypermutation occurring when a region is enriched in somatic SNVs (*Nik-Zainal S. et al 2012*). Kataegis can result from multiple cytosine deaminations catalyzed by the AID/APOBEC family of proteins (*Lada AG et al 2012*). A first step to understand kataegis requires the ability to reproducibly and reliability identify it. Although a formal, quantifiable definition of kataegis has not been reached, we have provided the first operational definition in the form of SeqKat, a R package that predicts kataegis from paired tumour normal human whole genome samples. This package contains functions to detect kataegis from SNVs in BED format. 


## Approach
SeqKat uses a sliding window (of fixed width) approach to test deviation of observed SNV trinucleotide content and inter-mutational distance from expected by chance alone. Additionally, an exact binomial test is performed to test that the proportion of each of the 32 tri-nucleotides within each window is higher than expected. The resulting p-values are then adjusted for multiple hypothesis testing using FDR. Hypermutation and kataegic scores are calculated for each window as follows


  
> *hypermutation score* = $-log_{10}$(binomial $p_{adj}$) * $\frac{N observed Mutations}{N expected Mutations}$  [Equation 1] 
  
> *kataegis score* = *hypermutation score* * $\frac{N TCX bases}{N expected TCX bases}$ [Equation 2]
  


SeqKat reports both hypermutation score and an APOBEC mediated kataegic score along with the start and end position of each detected event. A reference paper will be added upon publication in an upcoming version of this package. 

## Input
SeqKat accepts a SNV BED file per patient with the following columns:

- chromosome
- position
- reference base
- alternate base

## Running SeqKat

```{}
seqkat(sigcutoff = 5,
       mutdistance = 3.2,
       segnum = 4,
       ref.dir = NULL,
       bed.file = "./",
       output.dir = "./",
       chromosome = "all",
       chromosome.length.file = NULL,
       trinucleotide.count.file = NULL
       )
```

- **sigcutoff**: The minimum hypermutation score used to classify the windows in the sliding binomial test as significant windows. The score is calculated following [Equation 1]. *Recommended value*: 5

- **mutdistance**: The maximum intermutational distance allowed for SNVs to be grouped in the same kataegic event. *Recommended value*: 3.2

- **segnum**: Minimum mutation count. The minimum number of mutations required within a cluster to be identified as kataegic. *Recommended value*: 4

- **ref.dir**: Path to a directory containing the reference genome. Each chromosome should have its own .fa file and chromosomes X and Y are named as chr23 and chr24. The fasta files should contain no header.

- **bed.file**: Path to the SNV BED file. The BED file should contain the following information: Chromosome, Position, Reference allele, Alternate allele. The file should be named {SAMPLE_NAME}_snvs.bed. Below is an example of a BED file:


```{}
chr4	17185	G	A
chr4	38640	T	C
chr4	52598	C	T
chr4	53102	C	G
chr4	71989	G	A
chr4	91099	C	G
chr4	91139	G	C
chr4	192852	G	C
chr4	201573	G	C
chr4	212498	C	G
```

- **output.dir**: Path to a directory where output will be created

- **chromosome**: The chromosome to be analysed. This can be (1, 2, ..., 23, 24) or "all" to run sequentially on all chromosomes

- **chromosome.length.file** *(provided)*: A tab separated file containing the lengths of all chromosomes in the reference genome. Below is an example of a chromosome.length.file for hg19:

```{}
"num" "length"
"1" 249250621
"2" 243199373
"3" 198022430
"4" 191154276
"5" 180915260
"6" 171115067
"7" 159138663
"8" 146364022
"9" 141213431
"10" 135534747
"11" 135006516
"12" 133851895
"13" 115169878
"14" 107349540
"15" 102531392
"16" 90354753
"17" 81195210
"18" 78077248
"19" 59128983
"20" 63025520
"21" 48129895
"22" 51304566
"23" 155270560
"24" 59373566
"sum.f" 3036303846
"sum.m" 3095677412
```

- **trinucleotide.count.file** *(provided)*: A tab seprarated file containing a count of all trinucleotides present in the reference genome. This can be generated with the get.trinucleotide.counts() function in this package. Below is an example of trinuclotide.count.file for hg19:

```{}
trinucleotide	count
ACA	118307548
ACC	67377361
ACG	15031779
ACT	94371148
ATA	120046083
ATC	78231773
ATG	107040211
ATT	145343907
CCA	107257513
CCC	75979793
CCG	16160851
CCT	103230644
CTA	75285140
CTC	98529730
CTG	118268845
CTT	117139678
GCA	84247974
GCC	68786498
GCG	13995012
GCT	81425256
GTA	65911575
GTC	54961008
GTG	88082699
GTT	86132552
TCA	114943318
TCC	90485129
TCG	13117247
TCT	130196437
TTA	120231763
TTC	117019750
TTG	111364601
TTT	225211336
```


**note**: sigcutoff, multidistance and segnum default parameters are optimized using *Alexandrov et al*'s "Signatures of mutational processes in human cancer" dataset.

**note**: trinucleotide.count.file and chromosome.length.file have been provided for GRCh38 reference as well



## Output

If Kataegic events are detected, SeqKat generates a tab delimited file that includes details about the detected events. Each line represents one detected hypermutation or kataegic event. The file includes the following columns:

- **sample**: sample name
- **chr**: chromosome 
- **start**: coordinate indicating where the event starts
- **end**: coordinate indicating where the event ends
- **variants**: number of SNVs within the event
- **score.hm**: hypermutation score. 
- **score.kat**: kataegic score

**note**: if no event is detected then no file is generated

## Example

A subset BED file from the publically available breast cancer sample PD4120a is provided in the package. This BED contains 2804 SNVs in the first 2,000,000 bases of chromosome 4. A subset FASTA file and chromosome length file have also been provided for **testing purposes only**. 


```{}
example.bed.file <- paste0(
	path.package("SeqKat"),
	"/extdata/test/PD4120a-chr4-1-2000000_test_snvs.bed"
	);
example.ref.dir <- paste0(
	path.package("SeqKat"),
	"/extdata/test/ref/"
	);
example.chromosome.length.file <- paste0(
	path.package("SeqKat"),
	"/extdata/test/length_hg19_chr_test.txt"
	);
seqkat(
	5,
	3.2,
	2,
	bed.file = example.bed.file,
	output.dir = ".",
	chromosome = "4",
	ref.dir = example.ref.dir,
	chromosome.length.file = example.chromosome.length.file
	);
```


To view the detected events, you can check the file *PD4120a-chr4-1-2000000_chr4_cutoff5_mutdist3.2_segnum2.txt*

```{}
sample	chr	start	end	variants	score.hm	score.kat
PD4120a-chr4-1-2000000	4	1009070	1009541	4	119920.029973896	0
```

In this example, SeqKat detected one hypermutation window on chromosome 4 between 1009070 and 1009541, containing 4 SNVs with a hypermutation score of 119920.03. The kataegic score is 0, indicating that it is not an APOBEC mediated event. 


## References

- Nik-Zainal S, Alexandrov LB, Wedge DC, Van Loo P, Greenman CD, Raine K, Jones D, Hinton J, Marshall J, Stebbings LA, Menzies A, Martin S, Leung K, Chen L, Leroy C, Ramakrishna M, Rance R, Lau KW, Mudie LJ, Varela I, McBride DJ, Bignell GR, Cooke SL, Shlien A, Gamble J, Whitmore I, Maddison M, Tarpey PS, Davies HR, Papaemmanuil E, Stephens PJ, McLaren S, Butler AP, Teague JW, Jönsson G, Garber JE, Silver D, Miron P, Fatima A, Boyault S, Langerød A, Tutt A, Martens JW, Aparicio SA, Borg Å, Salomon AV, Thomas G, Børresen-Dale AL, Richardson AL, Neuberger MS, Futreal PA, Campbell PJ, Stratton MR; Breast Cancer Working Group of the International Cancer Genome Consortium.. **Mutational processes molding the genomes of 21 breast cancers**. *Cell*. 2012 May 25;149(5):979-93. doi: 10.1016/j.cell.2012.04.024. Epub 2012 May 17. PubMed PMID: 22608084; PubMed Central PMCID: PMC3414841.

- Lada AG, Dhar A, Boissy RJ, Hirano M, Rubel AA, Rogozin IB, Pavlov YI. **AID/APOBEC cytosine deaminase induces genome-wide kataegis**. *Biol Direct*. 2012 Dec 18;7:47; discussion 47. doi: 10.1186/1745-6150-7-47. PubMed PMID: 23249472; PubMed Central PMCID: PMC3542020
- Alexandrov LB, Nik-Zainal S, Wedge DC, Aparicio SA, Behjati S, Biankin AV, Bignell GR, Bolli N, Borg A, Børresen-Dale AL, Boyault S, Burkhardt B, Butler AP, Caldas C, Davies HR, Desmedt C, Eils R, Eyfjörd JE, Foekens JA, Greaves M, Hosoda F, Hutter B, Ilicic T, Imbeaud S, Imielinski M, Jäger N, Jones DT, Jones D, Knappskog S, Kool M, Lakhani SR, López-Otín C, Martin S, Munshi NC, Nakamura H, Northcott PA, Pajic M, Papaemmanuil E, Paradiso A, Pearson JV, Puente XS, Raine K, Ramakrishna M, Richardson AL, Richter J, Rosenstiel P, Schlesner M, Schumacher TN, Span PN, Teague JW, Totoki Y, Tutt AN, Valdés-Mas R, van Buuren MM, van 't Veer L, Vincent-Salomon A, Waddell N, Yates LR; Australian Pancreatic Cancer Genome Initiative.; ICGC Breast Cancer Consortium.; ICGC MMML-Seq Consortium.; ICGC PedBrain., Zucman-Rossi J, Futreal PA, McDermott U, Lichter P, Meyerson M, Grimmond SM, Siebert R, Campo E, Shibata T, Pfister SM, Campbell PJ, Stratton MR. **Signatures of mutational processes in human cancer**. *Nature*. 2013 Aug 22;500(7463):415-21. doi: 10.1038/nature12477. Epub 2013 Aug 14. Erratum in: Nature. 2013 Oct 10;502(7470):258. Imielinsk, Marcin [corrected to Imielinski, Marcin]. PubMed PMID: 23945592; PubMed Central PMCID: PMC3776390.