Welcome to PheVis package, an unsupervised R package for phenotyping at visit resolution. I will briefly explain how it works, for more details about the underlying method. The detailed method is available on MedRxiv (doi: https://doi.org/10.1101/2020.06.15.20131458)
First we need a toy dataset, luckily, there is one provided inside the package. Here are the first rows and columns of the dataset.
library(PheVis)
library(dplyr)
library(knitr)
library(ggplot2)
data("data_phevis")
kable(head(data_phevis[,1:7]))
subject | time | PR_state | mainICD | mainCUI | var1 | var2 |
---|---|---|---|---|---|---|
1 | 2000-01-11 | 1 | 1 | 12 | 10 | 0 |
1 | 2000-02-06 | 1 | 1 | 2 | 9 | 1 |
1 | 2000-02-08 | 1 | 0 | 6 | 8 | 1 |
1 | 2000-02-23 | 1 | 0 | 6 | 10 | 1 |
1 | 2000-02-25 | 1 | 1 | 11 | 7 | 1 |
1 | 2000-02-27 | 1 | 2 | 10 | 11 | 2 |
We are going to split it in train and test set, add an
ENCOUNTER_NUM
column (one identifier by visit) and convert
date to numeric format.
<- data_phevis %>%
df mutate(ENCOUNTER_NUM = row_number(),
time = round(as.numeric(time)))
set.seed(1)
<- 0.8*length(unique(df$subject))
trainsize <- sample(x = unique(df$subject), size = trainsize)
trainid <- unique(df$subject)[!unique(df$subject) %in% trainid]
testid
<- as.data.frame(df[df$subject %in% trainid,])
df_train <- as.data.frame(df[df$subject %in% testid,]) df_test
To train PheVis we first should set the parameters that are used by the algorithm:
var_vec
main_icd
and main_cui
GS
half_life
<- c(paste0("var",1:10), "mainCUI", "mainICD")
var_vec <- "mainICD"
main_icd <- "mainCUI"
main_cui <- "PR_state"
GS <- Inf half_life
Now we train the model, it might take a while.
<- PheVis::train_phevis(half_life = half_life,
train_model df = df_train,
START_DATE = "time",
PATIENT_NUM = "subject",
ENCOUNTER_NUM = "ENCOUNTER_NUM",
var_vec = var_vec,
main_icd = main_icd,
main_cui = main_cui)
#> -- Arguments passed check -> PheVis begins --
#> Joining with `by = join_by(id, id_encounter, var, start_date)`
#> Joining with `by = join_by(PATIENT_NUM, ENCOUNTER_NUM)`
#> Joining with `by = join_by(START_DATE, PATIENT_NUM, ENCOUNTER_NUM)`
Now that we have trained the model, we are going to use it to get probabilities prediction for the test set. It also takes a while as the cumulated variables must be computed for the test dataset too.
<- PheVis::test_phevis(train_param = train_model$train_param,
test_model df_test = df_test,
START_DATE = "time",
PATIENT_NUM = "subject",
ENCOUNTER_NUM = "ENCOUNTER_NUM",
surparam = train_model$surparam,
model = train_model$model)
#> -- Arguments passed check -> PheVis begins --
#> Joining with `by = join_by(id, id_encounter, var, start_date)`
#> Joining with `by = join_by(PATIENT_NUM, ENCOUNTER_NUM)`
#> Joining with `by = join_by(START_DATE, PATIENT_NUM, ENCOUNTER_NUM)`
We can access the different components of the returned objects.
train_model
is a list containing multiples objects. We have
surparam
, the parameters computed inside PheVis to compute
the surrogate (mean and sd of main surrogates).
$surparam
train_model#> $mean_icd
#> [1] 0.5073889
#>
#> $sd_icd
#> [1] 0.8720246
#>
#> $mean_nlp
#> [1] 4.532568
#>
#> $sd_nlp
#> [1] 4.554872
#>
#> $min_all
#> [1] -1.576955
model
which contains the fixed effect of the model and
the type of model trained (usually glmer for mixed effect logistic
regression but might be glm for logistic regression if mixed model fails
to converge).
$model
train_model#> $fixedEffect
#> (Intercept) mainCUI mainICD var1 var4
#> -31.829323195 -0.041409221 -0.017401159 0.037806247 -0.015078210
#> var9 var10 var1_cum var4_cum var9_cum
#> 0.082143744 -0.004552059 0.083685829 0.052506165 0.050956436
#> var10_cum mainCUI_cum mainICD_cum last_vis last_5vis
#> 0.097538021 0.067701132 0.312671148 -0.102492199 0.303248468
#> cum cum_month cum_year
#> 0.190078954 0.427306844 0.131637638
#>
#> $model
#> [1] "glmer"
df_train_result
the data.frame with the surrogates
(qualitative and quantitative), the output probability and the visit
id.
head(train_model$df_train_result)
#> ENCOUNTER_NUM PRED SUR_QUANTI SUR_QUALI
#> 1 1 1.593243e-11 3.781298 3
#> 2 2 1.777235e-09 5.367145 3
#> 3 3 1.072111e-07 6.684416 3
#> 4 4 1.422840e-06 8.001687 3
#> 5 5 2.292557e-04 11.563440 3
#> 6 6 5.327203e-01 16.052404 3
train_param
corresponds to the hyperparameters of PheVis
chosen by the user.
$train_param
train_model#> $half_life
#> [1] Inf
#>
#> $omega
#> [1] 2
#>
#> $var_vec
#> [1] "var1" "var2" "var3" "var4" "var5" "var6" "var7"
#> [8] "var8" "var9" "var10" "mainCUI" "mainICD"
#>
#> $main_icd
#> [1] "mainICD"
#>
#> $main_cui
#> [1] "mainCUI"
#>
#> $GS
#> NULL
df_x_train
is the final data.frame to predict the
probability with the cumulated features.
head(train_model$df_x_train[,c(1:2, 14:15, 29:30)])
#> var1 var2 var1_cum var2_cum cum_month cum_year
#> 1 10 0 10 0 3.781298 3.781298
#> 2 9 1 19 1 5.367145 5.367145
#> 3 8 1 27 2 6.684416 6.684416
#> 4 10 1 37 3 4.220388 8.001687
#> 5 7 1 44 4 7.782141 11.563440
#> 6 11 2 55 6 12.271106 16.052404
test_model
is also a list containing two data.frame. We
have df_result
with the predictions of the model.
head(test_model$df_result)
#> PATIENT_NUM ENCOUNTER_NUM START_DATE SUR_QUANTI SUR_QUALI PREDICTION
#> 1 18 347 10994 3.512722 3 2.592243e-12
#> 2 18 348 10998 6.854930 3 1.905788e-09
#> 3 18 349 11001 8.391746 3 3.516195e-07
#> 4 18 350 11008 9.928562 3 1.445644e-05
#> 5 18 351 11012 11.245832 3 4.459431e-04
#> 6 18 352 11014 14.539009 3 8.700905e-02
And df_pred
the data.frame with the model
predictions.
head(test_model$df_pred[,c(1:2, 14:15, 29:30)])
#> var1 var2 PATIENT_NUM ENCOUNTER_NUM last_5vis cum
#> 1 6 3 18 347 0.000000 3.512722
#> 2 4 2 18 348 3.512722 6.854930
#> 3 7 2 18 349 6.854930 8.391746
#> 4 4 1 18 350 8.391746 9.928562
#> 5 10 0 18 351 9.928562 11.245832
#> 6 3 2 18 352 11.245832 14.539009
We can display a graph with the prediction and the gold-standard with
the function ggindividual_plot
.
<- test_model$df_result %>%
df_plot left_join(df_test) %>%
filter(PATIENT_NUM %in% c(18, 23, 26, 32))
#> Joining with `by = join_by(ENCOUNTER_NUM)`
::ggindividual_plot(subject = df_plot$PATIENT_NUM,
PheVistime = df_plot$START_DATE,
gold_standard = df_plot$PR_state,
prediction = df_plot$PREDICTION)
Now we can see the performance of the model using ROC cure and Precision Recall (PR) curve
<-PRROC::pr.curve(scores.class0 = test_model$df_result$PREDICTION,
pr_curve weights.class0 = df_test$PR_state,
curve = TRUE)
plot(pr_curve)
<- PRROC::roc.curve(scores.class0 = test_model$df_result$PREDICTION,
roc_curve weights.class0 = df_test$PR_state,
curve = TRUE)
plot(roc_curve)