‘minimapR’ is a wrapper for ‘Minimap2’. ‘Minimap2’ is a very valuable long read aligner for the Pacbio and Oxford Nanopore Technologies sequencing platforms. ‘minimapR’ is an R wrapper for ‘minimap2’ which was developed by Heng Li me@liheng.org. This tool aligns long reads to a reference genome and is used in many different bioinformatics workflows.
library(minimapR)
#> Loading required package: Rsamtools
#> Loading required package: GenomeInfoDb
#> Loading required package: BiocGenerics
#>
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#>
#> IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#>
#> Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#> as.data.frame, basename, cbind, colnames, dirname, do.call,
#> duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
#> lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
#> pmin.int, rank, rbind, rownames, sapply, setdiff, table, tapply,
#> union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#> Loading required package: stats4
#>
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:utils':
#>
#> findMatches
#> The following objects are masked from 'package:base':
#>
#> I, expand.grid, unname
#> Loading required package: IRanges
#> Loading required package: GenomicRanges
#> Loading required package: Biostrings
#> Loading required package: XVector
#>
#> Attaching package: 'Biostrings'
#> The following object is masked from 'package:base':
#>
#> strsplit
#> Loading required package: git2r
#>
#> Attaching package: 'git2r'
#> The following objects are masked from 'package:Biostrings':
#>
#> as.data.frame, head
#> The following object is masked from 'package:XVector':
#>
#> as.data.frame
#> The following objects are masked from 'package:GenomicRanges':
#>
#> as.data.frame, merge
#> The following objects are masked from 'package:GenomeInfoDb':
#>
#> as.data.frame, merge
#> The following objects are masked from 'package:IRanges':
#>
#> as.data.frame, diff, merge
#> The following objects are masked from 'package:S4Vectors':
#>
#> as.data.frame, head, merge
#> The following object is masked from 'package:BiocGenerics':
#>
#> as.data.frame
#> Loading required package: pafr
#> Loading required package: ggplot2
‘minimap2’ and ‘samtools’ must be installed on your system along with the R packages ‘Rsamtools’, ‘git2r’, and ‘pafr’. ‘minimap2’ can be installed on various operating systems by running the following function or following the instruction from the output of the function. “/your/path/to/directory/for/install” should be replaced with the path to the directory where you want to install ‘minimap2’.
minimap2_installation("/your/path/to/directory/for/install")
check if dependencies were successfully installed.
minimap2_check()
#> minimap2 is installed.
#> minimap2
#> "/home/jake/software/minimap2/minimap2"
samtools_check()
#>
#> samtools is installed.
#> samtools
#> "/home/jake/anaconda3/bin/samtools"
We will download the reference genomes and the query sequences for the example. The reference genomes are the human genome (GRCh38) and the yeast genome (S288C). The query sequences are the human ONT reads and yeast HIFI reads.
tempdir()
tmp_folder <-cat("Temporary folder is:", tmp_folder, "\n")
#> Temporary folder is: /tmp/Rtmp8GTYNH
"https://github.com/jake-bioinfo/minimapR/raw/master/inst/extdata/GRCh38_chr1_50m.fa"
href_url <- "https://github.com/jake-bioinfo/minimapR/raw/master/inst/extdata/ont_hs_sample.fastq.gz"
hfq_url <- "https://github.com/jake-bioinfo/minimapR/raw/master/inst/extdata/S288C_ref_genome.fasta"
yref_url <- "https://github.com/jake-bioinfo/minimapR/raw/master/inst/extdata/yeast_sample_hifi.fastq.gz"
yfq_url <- c(href_url, hfq_url, yref_url, yfq_url)
url_list <-lapply(url_list, function(x) download.file(x, destfile = file.path(tmp_folder, basename(x))))
#> [[1]]
#> [1] 0
#>
#> [[2]]
#> [1] 0
#>
#> [[3]]
#> [1] 0
#>
#> [[4]]
#> [1] 0
# Contents of the temporary folder
cat("Contents of the temporary folder are:", "\n")
#> Contents of the temporary folder are:
list.files(tmp_folder, pattern = ".fa", full.names = TRUE)
fa_list <-
fa_list#> [1] "/tmp/Rtmp8GTYNH/GRCh38_chr1_50m.fa"
#> [2] "/tmp/Rtmp8GTYNH/S288C_ref_genome.fasta"
#> [3] "/tmp/Rtmp8GTYNH/ont_hs_sample.fastq.gz"
#> [4] "/tmp/Rtmp8GTYNH/yeast_sample_hifi.fastq.gz"
We will align the human ONT reads to the human genome and the yeast HIFI reads to the yeast genome using ‘minimap2’. The ‘minimap2’ function returns a data frame with the alignment information.
# Human ONT alignment
file.path(tmp_folder, "ont_hs_sample")
h_out <- minimap2(reference = fa_list[1],
h_alignment <-query_sequences = fa_list[2],
output_file_prefix = h_out,
preset_string = "map-ont",
threads = 4,
return = TRUE,
verbose = FALSE)
# Yeast HIFI alignment
file.path(tmp_folder, "yeast_sample_hifi")
y_out <- minimap2(reference = fa_list[3],
y_alignment <-query_sequences = fa_list[4],
output_file_prefix = y_out,
preset_string = "map-hifi",
threads = 4,
return = TRUE,
verbose = TRUE)
#> Generating output file: /tmp/Rtmp8GTYNH/yeast_sample_hifi
#> Running minimap2...
#> Running minimap with the following command:
#> /home/jake/software/minimap2/minimap2 -ax map-hifi -t 4 /tmp/Rtmp8GTYNH/ont_hs_sample.fastq.gz /tmp/Rtmp8GTYNH/yeast_sample_hifi.fastq.gz -o /tmp/Rtmp8GTYNH/yeast_sample_hifi.sam
# Check the alignment
cat("Alignment for the human sample is:\n")
#> Alignment for the human sample is:
1:5, 1:7]
h_alignment[#> qname flag rname strand pos qwidth mapq
#> 1 ref|NC_001135| 256 NC_000001.11_1-50000000 + 3929397 316620 0
#> 2 ref|NC_001224| 2064 NC_000001.11_1-50000000 - 4099809 611 1
#> 3 ref|NC_001135| 272 NC_000001.11_1-50000000 - 5386833 316620 0
#> 4 ref|NC_001224| 2064 NC_000001.11_1-50000000 - 9123181 665 1
#> 5 ref|NC_001143| 256 NC_000001.11_1-50000000 + 9223252 666816 0
cat("Alignment for the yeast sample is:\n")
#> Alignment for the yeast sample is:
1:5, 1:7]
y_alignment[#> qname flag rname strand pos qwidth mapq
#> 1 SRR13577847.50 4 <NA> <NA> NA NA NA
#> 2 SRR13577847.52 4 <NA> <NA> NA NA NA
#> 3 SRR13577847.144 4 <NA> <NA> NA NA NA
#> 4 SRR13577847.203 4 <NA> <NA> NA NA NA
#> 5 SRR13577847.251 4 <NA> <NA> NA NA NA
# Clean up
unlink(tmp_folder, recursive = TRUE)
sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] minimapR_0.0.1.0 pafr_0.0.2 ggplot2_3.5.1
#> [4] git2r_0.33.0 Rsamtools_2.20.0 Biostrings_2.72.0
#> [7] XVector_0.44.0 GenomicRanges_1.56.0 GenomeInfoDb_1.40.0
#> [10] IRanges_2.38.0 S4Vectors_0.42.0 BiocGenerics_0.50.0
#>
#> loaded via a namespace (and not attached):
#> [1] sass_0.4.9 utf8_1.2.4 generics_0.1.3
#> [4] bitops_1.0-7 stringi_1.8.4 digest_0.6.36
#> [7] magrittr_2.0.3 evaluate_0.24.0 grid_4.4.1
#> [10] fastmap_1.2.0 jsonlite_1.8.8 httr_1.4.7
#> [13] fansi_1.0.6 UCSC.utils_1.0.0 scales_1.3.0
#> [16] codetools_0.2-20 jquerylib_0.1.4 cli_3.6.3
#> [19] rlang_1.1.4 crayon_1.5.3 munsell_0.5.1
#> [22] withr_3.0.0 cachem_1.1.0 yaml_2.3.9
#> [25] tools_4.4.1 parallel_4.4.1 BiocParallel_1.38.0
#> [28] dplyr_1.1.4 colorspace_2.1-0 GenomeInfoDbData_1.2.12
#> [31] vctrs_0.6.5 R6_2.5.1 lifecycle_1.0.4
#> [34] zlibbioc_1.50.0 stringr_1.5.1 pkgconfig_2.0.3
#> [37] bslib_0.7.0 pillar_1.9.0 gtable_0.3.5
#> [40] glue_1.7.0 xfun_0.45 tibble_3.2.1
#> [43] tidyselect_1.2.1 knitr_1.48 htmltools_0.5.8.1
#> [46] rmarkdown_2.27 compiler_4.4.1