By default ampir
predicts the probability of a protein to be an antimicrobial peptide (AMP) or not based on a trained SVM model with as input known AMP sequences corresponding to a wide diversity of organisms. However, within the predict_amps
function there is a model
argument that allows users to pass their own trained model object. Using a different trained model might be useful when users wish to e.g. use a taxonomic specific model to predict AMPs in a restricted group of taxa.
This vignette will go through a mock example of how you can train your own model using the caret
package. For more information on how to use caret
and the functions used within this example, please see the extensive documentation made by the author, Max Kuhn.
First, a positive and negative dataset have to be obtained. In this example, we want to predict AMPs in bats and decide to train a model using protein sequences found in bats. The positive dataset are AMPs and the negative dataset are random sequences. Both datasets were obtained from UniProt:
For the positive dataset:
library(ampir)
<- read_faa(system.file("extdata/bat_positive.fasta.gz", package = "ampir"))
bat_pos $Label <- "Positive"
bat_pos<- remove_nonstandard_aa(bat_pos) bat_pos
For the negative dataset:
<- read_faa(system.file("extdata/bat_negative.fasta.gz", package = "ampir"))
bat_neg $Label <- "Negative"
bat_neg<- remove_nonstandard_aa(bat_neg)
bat_neg <- bat_neg[!bat_neg$seq_aa %in% bat_pos$seq_aa,]
bat_neg <- bat_neg[sample(nrow(bat_neg),78),] bat_neg
Combine the positive and negative dataset
<- rbind(bat_pos, bat_neg) bats
Calculate features on the combined positive and negative dataset and add the label column
<- calculate_features(bats)
bats_features $Label <- as.factor(bats$Label)
bats_featuresrownames(bats_features) <- NULL
library(caret)
Split feature set data and create train and test set with caret
<-createDataPartition(y=bats_features$Label, p=.7, list = FALSE)
trainIndex <-bats_features[trainIndex,]
bats_featuresTrain <-bats_features[-trainIndex,] bats_featuresTest
Resample method using repeated cross validation and adding in a probability calculation with caret
<- trainControl(method = "repeatedcv", number = 10, repeats = 3,
trctrl_prob classProbs = TRUE)
Train model using a support vector machine with radial kernel with caret
. Note: Other classification models are supported too. For example, to use a random forest model in caret
, method
could be changed from “svmRadial” to “ranger”.
<- train(Label~.,
my_bat_svm_model data = bats_featuresTrain[,-1], # excluding seq_name column
method="svmRadial",
trControl = trctrl_prob,
preProcess = c("center", "scale"))
Test model to get an indication of how well the model performs on test data with caret
<- predict(my_bat_svm_model, bats_featuresTest)
my_bat_pred <- confusionMatrix(my_bat_pred, bats_featuresTest$Label, positive = "Positive") cm
Subset from cm$byClass
Balanced Accuracy | Precision | Recall | F1 |
---|---|---|---|
0.98 | 0.96 | 1 | 0.98 |
Convert the bat feature test data to the original FASTA type format containing just the sequence name and sequence as this is the required input data for ampir
<- bats[bats$seq_name %in% bats_featuresTest$seq_name,][,-3]
bat_test_set rownames(bat_test_set) <- NULL
Use the trained bat model in ampir
’s predict_amps
function on the bat test set
<- predict_amps(bat_test_set, min_len = 5, model = my_bat_svm_model) my_bat_AMPs
my_bat_AMPs
sample
seq_name | seq_aa | prob_AMP | |
---|---|---|---|
1 | G1PHN9… | MKIYYYLLHF… | 0.996 |
2 | G1P1T7… | MKALLTLGLL… | 0.984 |
3 | G1QAP4… | MKALLTLGLL… | 0.981 |
44 | Q95727… | MTNIRKTHPL… | 0.018 |
45 | P14391… | VHLSGEEKAA… | 0.025 |
46 | Q330H3… | MNPYIFFIIM… | 0.024 |