#Introduction The MSiP is a computational approach to predict protein-protein interactions (PPIs) from large-scale affinity purification mass spectrometry (AP-MS) data. This approach includes both spoke and matrix models for interpreting AP-MS data in a network context. The “spoke” model considers only bait-prey interactions, whereas the “matrix” model assumes that each of the identified proteins (baits and prey) in a given AP-MS experiment interacts with each of the others. The spoke model has a high false-negative rate, whereas the matrix model has a high false-positive rate. Thus, although both statistical models have merits, a combination of both models has shown to increase the performance of machine learning classifiers in terms of their capabilities in discrimination between true and false positive interactions.
###Load the package
library(MSiP)
###Sample Data Description: A demo AP-MS proteomics dataset is provided in this package to guide the users about data structure.
data("SampleDatInput")
head(SampleDatInput)
## Experiment.id Replicate Bait Prey counts
## 1: 14 BN2198 NSP5 Q07065 17
## 2: 23 BN2214 ORF7A O75947 3
## 3: 16 BN2173 NSP7 Q8TCU6 2
## 4: 7 BN2177 NSP11 P30153 2
## 5: 22 BN2186 ORF6 Q8NBM4 1
## 6: 4 BN2190 N P51571 1
###Scoring based on “spoke-model”: Comparative Proteomic Analysis Software Suite (CompPASS) is a robust statistical scoring scheme for assigning scores to bait-prey interactions. The output from CompPASS scoring includes Z-score, S-score, D-score, WD-score and other features. This function was optimized from the.
datScoring <-
cPASS(SampleDatInput)
head(datScoring)
## Bait Prey AvePSM TotalPSM Ratio_totalPSM Ratio S_score Z_score
## 1 E A5YKK6 4 4 1.0000000 0.03846154 10.198039 4.899136
## 2 E O15144 5 17 0.2941176 0.11538462 6.582806 2.344863
## 3 E O95299 1 1 1.0000000 0.03846154 5.099020 4.899136
## 4 E P04899 7 7 1.0000000 0.03846154 13.490738 4.899136
## 5 E P07355 1 1 1.0000000 0.03846154 5.099020 4.899136
## 6 E P07954 4 4 1.0000000 0.03846154 10.198039 4.899136
## D_score WD_score Entropy
## 1 10.198039 0.6014181 0
## 2 6.582806 0.2893455 0
## 3 5.099020 0.3007091 0
## 4 13.490738 0.7956014 0
## 5 5.099020 0.3007091 0
## 6 10.198039 0.6014181 0
###Scoring based on “matrix-model”: The Dice coefficient was first applied by to score interaction between all identified proteins (baits and preys) in a given AP-MS expriment.
datScoring <-
diceCoefficient(SampleDatInput)
head(datScoring)
## BPI Dice
## 1 A4D1P6~A5YKK6 0.0000000
## 2 A4D1P6~E 0.0000000
## 3 A4D1P6~E9PAV3 0.0000000
## 4 A4D1P6~M 0.0000000
## 5 A4D1P6~N 0.0000000
## 6 A4D1P6~NSP1 0.6666667
Alternatively, Jaccard, Simpson, and Overlap scores can be used to score the interaction between all the identified proteins in a given AP-MS experiment.
#Jaccard coefficient
datScoring <-
jaccardCoefficient(SampleDatInput)
head(datScoring)
## BPI Jaccard
## 1 A4D1P6~A5YKK6 0.0
## 2 A4D1P6~E 0.0
## 3 A4D1P6~E9PAV3 0.0
## 4 A4D1P6~M 0.0
## 5 A4D1P6~N 0.0
## 6 A4D1P6~NSP1 0.5
#Simpson coefficient
datScoring <-
simpsonCoefficient(SampleDatInput)
head(datScoring)
## BPI Simpson
## 1 A4D1P6~A5YKK6 0
## 2 A4D1P6~E 0
## 3 A4D1P6~E9PAV3 0
## 4 A4D1P6~M 0
## 5 A4D1P6~N 0
## 6 A4D1P6~NSP1 1
#Overlap score
datScoring <-
simpsonCoefficient(SampleDatInput)
head(datScoring)
## BPI Simpson
## 1 A4D1P6~A5YKK6 0
## 2 A4D1P6~E 0
## 3 A4D1P6~E9PAV3 0
## 4 A4D1P6~M 0
## 5 A4D1P6~N 0
## 6 A4D1P6~NSP1 1
Finally, a weighted matrix model can also be employed to score interactions between identified proteins in a given AP-MS experiment. The output of the weighted matrix model includes the number of experiments for which the pair of proteins is co-purified (i.e., k) and \(-1\)*log(P-value) of the hypergeometric test (i.e., logHG) given the experimental overlap value, each protein's total number of observed experiments, and the total number of experiments.
datScoring <-
Weighted.matrixModel(SampleDatInput)
## Joining, by = "UniprotID"
## Joining, by = "UniprotID"
head(datScoring)
## BPI k logHG
## 1 A4D1P6~Q12931 2 4.762174
## 2 O00160~P53396 2 3.690590
## 3 O00160~Q9UBT2 2 4.762174
## 4 O00231~P28331 2 3.558201
## 5 O15144~Q9P2D7 2 3.690590
## 6 O43776~P12004 2 4.762174
###Assign a confidence score to each instances using classifiers: The labeled feature matrix can be used as input for Support Vector Machine (SVM) or Random Forest (RF) classifiers. The classifier then assigns each bait-prey pair a confidence score, indicating the level of support for that pair of proteins to interact. Hyperparameter optimization can also be performed to select a set of parameters that maximizes the model's performance. The RF and the SVM functions provided in this package also computes the areas under the precision-recall (PR) and ROC curve to evalute the performance of the classifier.
####Import the demo data:
data("testdfClassifier")
head(testdfClassifier)
## BPI Dice Jaccard Overlap Simpson k logHG target
## 1 O00232~P35998 0.3478261 0.2105263 0.2105263 1.0000000 4 1.5102503 1
## 2 P18077~P42766 0.7272727 0.5714286 0.5538462 0.9230769 12 3.0401840 1
## 3 Q08722~Q96N66 0.6153846 0.4444444 0.4444444 1.0000000 8 3.9266218 0
## 4 O43684~P42224 0.6363636 0.4666667 0.4083333 0.7000000 7 3.0103311 0
## 5 Q13838~Q14498 0.5882353 0.4166667 0.3571429 0.7142857 5 3.1524199 1
## 6 P22314~P61224 0.3333333 0.2000000 0.1142857 0.4000000 2 0.9471408 0
####Run the RF classifier:
## named numeric(0)
####Output from RF classifier:
#positive score corresponds to the level of support for the pair of proteins to be true positive
#negative score corresponds to the level of support for the pair of proteins to be true negative
head(predidcted_RF)
## fulldat[, 1] Positive Negative
## 1 O00232~P35998 0.49303217 0.50696783
## 2 P18077~P42766 0.77920137 0.22079863
## 3 Q08722~Q96N66 0.22850952 0.77149048
## 4 O43684~P42224 0.33594286 0.66405714
## 5 Q13838~Q14498 0.90883333 0.09116667
## 6 P22314~P61224 0.06761905 0.93238095
####Run the SVM classifier:
#only generate the ROC curve
predidcted_SVM <-
svmTrain(testdfClassifier,impute = FALSE,p = 0.3,parameterTuning = FALSE,
cost = seq(from = 2, to = 10, by = 2),
gamma = seq(from = 0.01, to = 0.10, by = 0.02),
kernel = "radial",ncross = 10,
pr.plot = FALSE, roc.plot = TRUE
)
## named numeric(0)
## Setting levels: control = Positive, case = Negative
## Setting direction: controls < cases
####Output from SVM classifier:
#positive score corresponds to the level of support for the pair of proteins to be true positive
#negative score corresponds to the level of support for the pair of proteins to be true negative
head(predidcted_SVM)
## Positive Negative
## 1 "O00232~P35998" "0.40159000790017" "0.59840999209983"
## 2 "P18077~P42766" "0.540179276612381" "0.459820723387619"
## 3 "Q08722~Q96N66" "0.403983380113873" "0.596016619886127"
## 4 "O43684~P42224" "0.402883087812647" "0.597116912187353"
## 5 "Q13838~Q14498" "0.399390389293187" "0.600609610706813"
## 6 "P22314~P61224" "0.39057315208806" "0.60942684791194"