DBTC is an R implementation of Dada2 and a BLAST approach to metabarcoding analysis.
molecular analysis of high-throughput metabarcode molecular sequence data fastq files and taxonomic assignment of unique reads (in fasta format files) using the Basic Local Alignment Search Tool (BLAST) and R to reduced obtained results (See additional documentation at DBTC). The Dada-BLAST-Taxon Assign-Condense (‘DBTC’) package contains the foundational ‘DBTC’ functions run through the R command line (But also see ‘DBTCShiny’ and DBTCShinyTutorial). The ‘DBTC’ functions have four main outcomes…
NOTE: While the DBTC package has been built for the analysis of high-throughput sequencing results, the BLAST, taxonomic assignment, and taxonomic condense can be utilized with single specimen Sanger sequencing data.
Also see DBTCShiny and the associated DBTCShinyTutorial.
DBTC can be installed three ways.
install.packages(‘DBTCShiny’)
Run the following commands in your R terminal…
if(!require(devtools)) install.packages('devtools')
library('devtools')
devtools::install_github('rgyoung6/DBTC')
library('DBTC')
Note: the first command to install the “devtools” may not be necessary if already installed.
Navigate to the DBTC GitHub page. Download the files associated with this page to your local computer and place them somewhere in the main file folder named DBTC. Then run the following command pointing to that location on your local computer by replacing the HERE with the path in the below command…
library("DBTC", lib.loc="HERE")
Follow the instructions on the NCBI BLAST+ executables page to obtain a local version of the BLAST tools. The list of the latest installation files can be found here.
Note: It is best to download and install the most recent versions of the blast+ suite to your computer and place the programs in your computers path so you can access the program from any folder. However, the program files for both blastn and makeblastdb have been included in the DBTCShinyTutorial GitHub page for ease of use (please note that these may not be the most recent versions).
The R package taxonomizr website is used to establish a NCBI taxaID database (NOTE: this package is also required when using the taxon assignment elements in the DBTC pipeline).
install.packages('taxonomizr')
library('taxonomizr')
NCBI BLASTable databases can be established through two methods.
Download your desired preformatted NCBI database by using the ‘update_blastdb.pl’ (found in the NCBI BLAST+ local install folder). NOTE: Perl programming language needs to be installed on your local machine. Instructions can be found at Get NCBI BLAST databases.
You can download your desired preformatted NCBI database manually with instructions at BLAST FTP Site and a list of the available databases at Index of /blast/db.
In addition to the NCBI resources, DBTC can also use custom databases. To establish these databases you will require a Fasta file with the desired records with MACER formatted headers. The MACER R package and instructions can be found at either of the two locations:
MACER GitHub (will have the most recent version and development versions)
In the ‘Preparation’ section of the taxonomizr website, use the instructions and the prepareDatabase(‘accessionTaxa.sql’, getAccessions = FALSE) taxonomizr command to establish a local taxonomy database.
prepareDatabase('accessionTaxa.sql', getAccessions = FALSE)
Note: Along with the accessionTaxa.sql two other files nodes.dmp and names.dmp files are downloaded. These two files are not necessary for downstream analyses and can be deleted.
The ShortRead package is required to run elements of the DBTC pipeline and can be obtained through Bioconductor.
The dada2 package is main package to process the raw fastq files and can be obtained from Bioconductor. There is also a dada2 GitHub resource.
if (!require('BiocManager', quietly = TRUE))
install.packages('BiocManager')
BiocManager::install('ShortRead')
BiocManager::install('dada2')
library(dada2)
Each of below CRAN packages and their dependencies are required for the DBTC package.
install.packages(c('ggplot2',
'parallel',
'pbapply',
'plyr',
'stats',
'taxonomizr',
'utils'))
library(c('ggplot2',
'parallel',
'pbapply',
'plyr',
'stats',
'taxonomizr',
'utils'))
After installation of the DBTC and all of its dependencies you need to load the package using the following commands…
library(DBTC)
The dada_implement() function takes fastq files as input, analyses them and produces amplicon sequence variant (ASV) files. This function requires a main directory containing folder(s) representing sequencing runs which in-turn contains fastq files (the location of one of the fastq files in one of the sequencing run folders is used as an input argument). A run is a group of results processed at the same time on the same machine representing the same molecular methods. All sequencing folders in the main directory need to represent data from sequencing runs that have used the same primers and protocols. Output from this function includes all processing files and final main output files in the form of fasta files and ASV tables.
DBTC dada_implement() uses ASV output files (‘YYYY_MM_DD_HH_MM_UserInputRunName_Merge’ and/or ‘YYYY_MM_DD_HH_MM_UserInputRunName_MergeFwdRev’) and combines them into a single ASV table and creates an accompanying fasta file. This function also produces a file containing the processing information for the function. The main input argument for this function is the location of a file in a folder containing all ASV tables wanting to be combined. Output files are generated with the naming convention ‘YYYY_MM_DD_HH_MM_combinedDada’.
This function takes a fasta file with headers in the MACER format (Young et al. 2021) and establishes a database upon which a BLAST search can be completed. However, if a NCBI sequence database is desired, it is advisable to use, where applicable, NCBI preformatted databases and skip the make_BLAST_DB() function (https://www.ncbi.nlm.nih.gov/books/NBK62345/#blast_ftp_site.The_blastdb_subdirectory). The outcome of the function is a folder with a BLASTable NCBI formatted sequence database.
The MACER fasta header format
## seq_BLAST()
[Fasta](https://en.wikipedia.org/wiki/FASTA_format) file(s) are used as input along with a user selected NCBI formatted database upon which query sequences will be searched using BLAST. The outcome of the function are two files, a BLAST run file and a single file containing all of the BLAST results in tab delimited format. There are no headers in the BLAST results file but the columns (in order left to right) are: query sequence ID, search sequence ID, search taxonomic ID, query to sequence coverage, percent identity, search scientific name, search common name, query start, query end, search start, search end, e-value.
## taxon_assign()
This function takes a BLAST result file and associated [fasta](https://en.wikipedia.org/wiki/FASTA_format) files (either on their own or with accompanying [ASV](https://en.wikipedia.org/wiki/Amplicon_sequence_variant) files generated from the dada_implement() function) and collapses the multiple BLAST results into as single result for each query sequence. When an [ASV](https://en.wikipedia.org/wiki/Amplicon_sequence_variant) table is present the taxonomic results will be combined with the [ASV](https://en.wikipedia.org/wiki/Amplicon_sequence_variant) table.
## combine_assign_output()
The combine_assign_output() function takes a file selection and then uses all DBTC [ASV](https://en.wikipedia.org/wiki/Amplicon_sequence_variant) taxon assign files ('_taxaAssign_YYYY_MM_DD_HHMM.tsv') in a selected directory and combines them into a single output 'YYYY_MM_DD_HHMM_taxaAssignCombined.tsv' file. The files being combined should represent different samples but representing data that have all come from analysis using the same molecular methods, the same analysis arguments, and the same molecular sequence databases.
## reduce_taxa()
To reduce [ASV](https://en.wikipedia.org/wiki/Amplicon_sequence_variant) results to unique taxa per file the reduce_taxa() function takes a file selection and then uses all '_taxaAssign_YYYY_MM_DD_HHMM.tsv' and/or 'YYYY_MM_DD_HHMM_taxaAssignCombined.tsv' files in that directory. This function then reduces all [ASV](https://en.wikipedia.org/wiki/Amplicon_sequence_variant) with the same taxonomic assignment into a single result and places these results in a '_taxaReduced_YYYY_MM_DD_HHMM.tsv' file for each of the target files in the directory.
## combine_reduced_output()
This function takes a file selection and then uses all '_taxaReduced_YYYY_MM_DD_HHMM.tsv' files in that directory and combines them into a single 'YYYY_MM_DD_HHMM_CombineTaxaReduced.txt' taxa table file with presence absence results. The files being combined should represent the same biological samples but with different molecular marker information. The output [ASV](https://en.wikipedia.org/wiki/Amplicon_sequence_variant) can include read numbers or be reduced to presence absence results.
([Back to Top](#table-of-contents))
***
# Naming convention rules
WARNING - NO WHITESPACE!
When running DBTC functions the paths for the files selected cannot have white space! File folder locations should be as short as possible (close to the [root](https://en.wikipedia.org/wiki/Root_directory) directory) as some functions do not process long naming conventions.
Also, special characters should be avoided (including question mark, number sign, exclamation mark). It is recommended that dashes be used for separations in naming conventions while retaining underscores for use as information delimiters (this is how DBTC functions use underscore).
There are several key character strings used in the DBTC pipeline, the presence of these strings in file or folder names will cause errors when running DBTC functions.
The following strings are those used in DBTC and should not be used in file or folder naming:
- _BLAST
- _combinedDada
- _taxaAssign
- _taxaAssignCombined
- _taxaReduced
- _CombineTaxaReduced
([Back to Top](#table-of-contents))
***
# Package Function Details
## Dada Implement
dada_implement() - Process metabarcode raw fastq files by [runs](#runs) using Dada2 (Note: molecular markers are independently analysed and combined at the end of the analysis pipeline using the [combine_reduced_output()](#combine-reduced-output) function).
### Input
Two files are required as input for the dada_implement() function. The first are the fastq files in the appropriate folder structure (see below) and the second is a file containing the primers used for the amplification of the sequence reads (tab separated file, see [DBTCShinyTutorial](https://github.com/rgyoung6/DBTCShinyTutorial) for file examples). To select all of the fastq files simply select one file in one of the [Run](#runs) directories to point to the desired data files, as long as the file and folder structure is correct.
**Fastq File Folder Structure**
Parent Directory
|
|
-----------------
| |
| |
Run1 Directory Run2 Directory
-Fastq -Fastq
-Fastq -Fastq
... ...
**Format of the primer file**
| Forward | Reverse |
| :------------- |:-------------|
| AGTGTGTAGTGATTG | CGCATCGCTCAGACTGACTGC |
| GAGCCCTCGATCGCT | GGTCGATAGCTACGCGCGCATACGACT |
| | GGTTCACATCGCATTCAT |
### Arguments
- <strong>runFolderLoc -</strong> Select a file in the one of the [run](#runs) folders with the fastq files of interest (Default NULL).
- <strong>primerFile -</strong> Select a file with the primers for this analysis (Default NULL).
- <strong>fwdIdent -</strong> Forward identifier naming string (Default '_R1_001').
- <strong>revIdent -</strong> Reverse identifier naming string (Default '_R2_001').
- <strong>bidirectional -</strong> Selection to process paired forward and reverse sequence for analysis (Default TRUE).
- <strong>unidirectional -</strong> Selection to process files independently (Default FALSE).
- <strong>printQualityPdf -</strong> Selection to process save image files showing quality metrics (Default TRUE).
- <strong>maxPrimeMis -</strong>Maximum number of mismatches allowed when pattern matching trimming the primers from the ends of the reads for the ShortRead trimLRPatterns() function (Default 2).
- <strong>fwdTrimLen -</strong> Select a forward trim length for the Dada filterAndTrim() function (Default 0).
- <strong>revTrimLen -</strong> Select a reverse trim length for the Dada filterAndTrim() function (Default 0).
- <strong>maxEEVal -</strong> Maximum number of expected errors allowed in a read for the Dada filterAndTrim() function (Default 2).
- <strong>truncQValue -</strong> Truncation value use to trim ends of reads, nucleotides with quality values less than this value will be used to trim the remainder of the reads for the Dada filterAndTrim() function (Default 2).
- <strong>truncLenValueF -</strong> Dada forward length trim value for the Dada filterAndTrim() function. This function is set to 0 when the pattern matching trim function is enabled (Default 0).
- <strong>truncLenValueR -</strong> Dada reverse length trim value for the Dada filterAndTrim() function. This function is set to 0 when the pattern matching trim function is enabled (Default 0).
- <strong>error -</strong> Percent of fastq files used to assess error rates for the Dada learnErrors() function (Default 0.1).
- <strong>nbases -</strong> The total number of bases used to assess errors for the Dada learnErrors() function (Default 1e80) NOTE: this value is set very high to get all nucleotides in the error present file subset. If the error is to be assessed using total reads and not specific fastq files then set the error to 1 and set this value to the desired number of reads.
- <strong>maxMismatchValue -</strong> Maximum number of mismatches allowed when merging two reads for the Dada mergePairs() function (Default 2).
- <strong>minOverlapValue -</strong> Minimum number of overlapping nucleotides for the forward and reverse reads for the Dada mergePairs() function (Default 12).
- <strong>trimOverhang -</strong> Trim merged reads past the start of the complimentary primer regions for the Dada mergePairs() function (Default FALSE).
- <strong>minFinalSeqLen -</strong> The minimum final desired length of the read (Default 100).
- <strong>verbose -</strong> If set to TRUE then there will be output to the R console, if FALSE then this reporting data is suppressed (Default TRUE).
### Output
The output files from this function appear in four folders. See the below diagram for the saved file structure.
Parent Directory
|
|
-----------------------------
| |
| |
Run1 Directory Run2 Directory
- Fastq - Fastq
- Fastq - Fastq
... ...
|
|
--------------------------------------------------------
| | | |
YYYY_MM_DD_HHMM_Run1_A_Qual | | | -revQual.pdf | YYYY_MM_DD_HHMM_Run1_C_FiltQual | … | -filtFwdQual.pdf | | -filtRevQual.pdf | | … | | | | YYYY_MM_DD_HHMM_Run1_D_Output | -dadaSummary.txt | -dadaSummaryTable.tsv | -ErrorForward.pdf | -ErrorReverse.pdf | -MergeFwdRev.tsv YYYY_MM_DD_HHMM_Run1_B_Filt -MergeFwdRev.fas -fwdFilt.fastq -Merge.tsv -revFilt.fastq -Merge.fas … -TotalTable.tsv Primer_Trim -primeTrim.fastq …
```
Quality pdf’s in the A_Qual folder represent the quality metrics for the raw Fastq files (This folder may not be present if printQualityPdf is set to FALSE). Files in the B_Filt folder represent the trimming (in the Primer_Trim folder) and trimmed and cleaned in the larger folder. Quality pdf’s in the C_FiltQual folder represent the quality metrics for the trimmed and cleaned Fastq files (This folder may not be present if printQualityPdf is set to FALSE). There are numerous files in the D_Output folder. These include:
(Back to Top) ***
combine_dada_output() - Combine Dada2 ASV tables (YYYY_MM_DD_HHMM_FileName_MergeFwdRev.tsv OR YYYY_MM_DD_HHMM_FileName_Merge.tsv) into a single ASV output file.
Two or more files to be combined are required as input for this function. These files need to be ASV files as outputted from the dada_inplement() and can include Merge, MergeFwdRev, or TotalTable.tsv files.
The output from this function includes three files. 1. YYYY_MM_DD_HHMM_combinedDada.tsv - combined ASV table 2. YYYY_MM_DD_HHMM_combinedDada.fas - combined Fasta file 3. YYYY_MM_DD_HHMM_combinedDada.txt - Summary file from the combine_dada_output()
Outputted data files will come in the same ASV table format as the output dada_inplement() ASV files.
(Back to Top) ***
make_BLAST_DB() - Create a local database to BLAST against.
This function takes a Fasta file (in MACER format) and establishes a database upon which a BLAST search can be completed. The outcome of the function is a folder with an NCBI database. - The MACER Fasta header format - >UniqueID|OtherInformation|Genus|species|OtherInformation|Marker
- An example of the header format output from the MACER program is >GenBankAccessionOrBOLDID|GenBankAccession|Genus|species|UniqueID|Marker
The output from this function includes a folder with the BLAST database named according to the submitted dbName.
The constructed database can then be used with the seq_BLAST() function.
seq_BLAST() - Search Fasta files of unknown sequences against a BLAST formatted database.
Provide a location for the BLAST database you would like to use by selecting a file in the target directory (This can be a built database using the make_BLAST_DB() function or a preformatted NCBI BLAST database). Then provide the location of the query sequence files by indicating a file in a directory that contains the Fasta files. Provide the path for the blast+ blastn program. Finally, provide the minimum query sequence length to BLAST (Default = 100), the maximum depth of the BLAST returned results (Default = 200), and finally the number of cores to process the function (default = 1, Windows implementation can only use a single core and will default to this value when running on Windows).
Two files are produced from this function, a BLAST run file and a BLAST results file for each of the Fasta files in the target directory.
The BLAST run file contains the command used to run the BLAST search. The BLAST results file includes all results in a tab delimited .tsv file format with the columns qseqid, sseqid, staxid, qcovs, pident, ssciname, scomname, qstart, qend, sstart, send, evalue.
taxon_assign() - Using BLAST results to construct a table with taxonomic assignments for each unique sequence.
This function requires a BLAST output file and an associated Fasta file. In addition, if present an ASV file will also be used to combine with the taxonomic results. The function will take the BLAST results and reduce the taxonomic assignment to a single result for each read.
A single taxonomic assignment file is created for each BLAST output file with the naming convention having the string ’_taxaAssign_YYYY_MM_DD_HHMM.tsv’. In addition, there is a run file that is also generated which contains the run details with the naming convention ‘YYYY_MM_DD_HHMM_taxaAssign.txt’.
The number of returned BLAST results is dictated by the seq_BLAST() BLASTResults argument. The taxon_assign() function takes into account all returned BLAST results for each read. At each taxonomic level assignments have quality metrics in parentheses after the name. These values (“Num_Rec”, “Coverage”, “Identity”, “Max_eVal”) represent the number of records with this taxonomic placement, the minimum coverage and identity, and the maximum eValue for the reported taxa.
Column headers for the resulting taxonomic assignments include…
uniqueID, superkingdom, phylum, class, order, family, genus, species, Top_BLAST, Lowest_Single_Ran, Lowest_Single_Taxa, Lowest_Single_Rank_Above_Thres, Lowest_Single_Taxa_Above_Thres, Final_Common_Names, Final_Rank, Final_Taxa, Final_Rank_Taxa_Thres, Result_Code, Sequence, Length, Results, followed by columns of samples containing ASV values
There are three columns that deserve special explanation.
The Result_Code column contains flags placed on the results to better understand the quality of the resulting taxonomic assignments. Below is a list of codes:
Note: Records with BIRT and BCRT flags should be highly scrutinized. TBAT are also concerning in that they may represent a less specific taxonomic placement due to the moving of the result to a higher taxonomic placement. The taxonomic rank directly below the final reported rank should be reviewed as there may be potential to adjust the final taxonomic assignment. SANF results should be explored and the size of the database, the trust placed in the records in the database, and the depth of the BLAST results should be considered when assessing records with this flag. Records with SFAT are among the least concerning as the BLAST results were saturated but this taxonomic assignment saturation occurred above your set quality coverage and identity threshold. Concerns with records with this result could be that the depth of the BLAST analysis was not low enough for very large databases, or that the database is not complete (taxonomic breadth) when using smaller databases.
combine_assign_output() - Using results from the taxon_assign() function, combines all files with the string ’_taxaAssign_YYYY_MM_DD_HHMM.tsv’ in to a single .tsv file.
Select a file in a folder with the taxa assigned files you would like to combine (extension ’_taxaAssign_YYYY_MM_DD_HHMM.tsv’). NOTE: all ’_taxaAssign_YYYY_MM_DD_HHMM.tsv’ files in the folder location should originate from the same dada output file but have outputs from different BLAST sequence libraries and therefore contain the same ASV’s.
This function produces a ‘YYYY_MM_DD_HHMM_taxaAssignCombined.tsv’ and a ‘YYYY_MM_DD_HHMM_taxaAssignCombined.txt’ file in the selected target directory.
The interpretation of the output file for the combine_assign_output() ‘YYYY_MM_DD_HHMM_taxaAssignCombined.tsv’ files is the same as the is the same as the taxon_assign() ’_taxaAssign_YYYY_MM_DD_HHMM.tsv’ files.
reduce_taxa() - Using results from taxon_assign() and/or Combine Assignment Output this function combines all reads with the same taxonomic assignment into a single result for each of the files submitted.
This function requires a file in a directory where all ’_taxaAssign_YYYY_MM_DD_HHMM.tsv’ and/or ‘YYYY_MM_DD_HHMM_taxaAssignCombined.tsv’ files in that directory will be combined. All records with the same taxonomic result will be combined. The BLAST values in parentheses (“Num_Rec”, “Coverage”, “Identity”, “Max_eVal”) are combine by the mean number of records, the mean of the minimum coverage and identity values, and the mean of the maximum eValues.
This function produces a ’_taxaReduced_YYYY_MM_DD_HHMM.tsv’ file for every ’_taxaAssign_YYYY_MM_DD_HHMM.tsv’ or ‘YYYY_MM_DD_HHMM_taxaAssignCombined.tsv’ present in the target directory.
Reduced taxonomic assignment files have fewer columns in the main taxa_reduced.tsv file than the ’_taxaAssign_YYYY_MM_DD_HHMM.tsv’ or ‘YYYY_MM_DD_HHMM_taxaAssignCombined.tsv’ files as columns are collapsed. In addition, the values in the taxonomic columns in parentheses represent the average values across all of the results with the same taxonomic assignment (see taxon_assign() interpretation).
The columns include, superkingdom, phylum, class, order, family, genus, species, Top_BLAST, Final_Common_Names, Final_Rank, Final_Taxa, Result_Code, RepSequence, Number_ASV, Average_ASV_Length, Number_Occurrences, Average_ASV_Per_Sample, Median_ASV_Per_Sample, Results. NOTE: The representative sequence (RepSequence) is the longest sequence for each of the collapsed taxa assigned.
combine_reduced_output() - This function takes ’_taxaReduced_YYYY_MM_DD_HHMM.tsv’ files generated from the same biological samples but representing different amplified molecular markers and collapses these data into a single file. The outcome of this process results in a presence absence matrix for all taxa and markers.
Select a file in a folder with ’_taxaReduced_YYYY_MM_DD_HHMM.tsv’ files representing data for the same biological samples but representing different amplified molecular markers.
Two files, a ‘YYYY_MM_DD_HHMM_CombineTaxaReduced.tsv’ result file and a ‘YYYY_MM_DD_HHMM_CombineTaxaReduced.txt’ run summary file are generated from this function. The result file contains presence/absence data in a matrix that associates the data with samples, taxa, and molecular marker. The column headers in the results file includes the following, superkingdom, phylum, class, order, family, genus, species, markers(n number of columns), samples (n number).
The interpretation of the output file is the same as the taxon_assign() ’_taxaAssign_YYYY_MM_DD_HHMM.tsv’ files.
Young RG, et al., Hanner RH (2024) A Scalable, Open Source, Cross Platform, MetaBarcode Analysis Method using Dada2 and BLAST. Biodiversity Data Journal (In progress)