library(idopNetwork)
<- options()
backup_options #load pre-computered results
= idopNetwork:::test_result test_result
idopNetwork is packed as a cartographic tool that performs power curve fitting, classification, variable selection, microbial abundance decomposition, and network visualization based on microbial 16S rRNA gene sequencing metadata.
For complete details on the use and execution of this protocol, please refer to Chen et al and Cao et al.
Before running idopNetwork, user need to provide datasets, and they should be cleaned and merged to exactly the same format of the example data.
Microbe Operational taxonomic unit dataset must have first column as IDs for microbes.
data("gut_microbe")
View(gut_microbe)
name | GQ_UC1 | GQ_UC3 | GQ_UC4 | GQ_UC5 |
---|---|---|---|---|
Escherichia_coli | 176 | 1701 | 28506 | 177 |
Arthrobacter_oxydans | 3 | 215 | 1578 | 1 |
Ruminococcus_gnavus | 744 | 0 | 4 | 733 |
Bacteroides_plebeius | 271 | 4401 | 7 | 8394 |
Bacteroides_pyogenes | 0 | 15 | 0 | 0 |
Clostridium_tertium | 305 | 1 | 1 | 270 |
Bacteroides_stercoris | 24 | 0 | 0 | 11 |
Flavisolibacter_ginsengisoli | 0 | 123 | 831 | 0 |
Bacteroides_massiliensis | 0 | 6098 | 6 | 0 |
Prevotella_heparinolytica | 0 | 6 | 0 | 0 |
data("mustard_microbe")
View(mustard_microbe)
ID | 8_0149.leaf | 8_0157.leaf | 8_0158.leaf | 8_0160.leaf | |
---|---|---|---|---|---|
2 | OTU_1 | 3 | 2 | 59 | 268 |
3 | OTU_2 | 8 | 13 | 1890 | 44 |
4 | OTU_3 | 5485 | 3302 | 4522 | 6581 |
5 | OTU_4 | 42 | 351 | 7 | 4 |
6 | OTU_5 | 16 | 2002 | 24 | 43 |
7 | OTU_6 | 1645 | 1170 | 10636 | 31686 |
8 | OTU_7 | 4050 | 4094 | 13 | 66 |
9 | OTU_8 | 6 | 282 | 124 | 27 |
10 | OTU_9 | 102 | 5 | 10 | 18 |
11 | OTU_10 | 25 | 1393 | 35 | 19 |
JAM_E4.leaf | 8_0149.root | 8_0157.root | 8_0158.root | 8_0160.root | |
---|---|---|---|---|---|
2 | 7 | 28861 | 122 | 1 | 9 |
3 | 4 | 11618 | 19 | 14 | 1894 |
4 | 4981 | 481 | 755 | 17914 | 15998 |
5 | 622 | 862 | 1072 | 11 | 103 |
6 | 10014 | 447 | 758 | 19 | 289 |
7 | 577 | 1010 | 490 | 25030 | 29443 |
8 | 2233 | 13 | 5575 | 26 | 64 |
9 | 19 | 3348 | 36 | 33 | 46 |
10 | 362 | 6269 | 228 | 2 | 459 |
11 | 3071 | 4 | 6454 | 6 | 234 |
The first major step in our idopNetwork reconstruction is to fit allometric growth curves to the data using the power function. This is easily done by using the related function power_fit. This function needs cleaned dataset as input and will return fitted OTUs for given dataset. Then the fitted output with original dataset can be transfer into function power_equation_plot for quick visualization
= data_cleaning(gut_microbe) df
= power_equation_fit(df) result1
power_equation_plot(result1)
In this step we implement the power equation into functional clustering to detect different microbial modules. If after clustering there are still too many microbes within a certain module for network reconstruction, we can rerun functional clustering to further classify this module into distinct submodules.
we fit mean vector of each cluster center by power equation(assume k=3)
matplot(t(power_equation(x = 1:30, matrix(c(2,1,3,0.2,0.5,-0.5),nrow = 3, ncol = 2))),
type = "l",
xlab = "time",
ylab = "population")
legend("topright",
c("cluster 1", "cluster 2", "cluster 3"),
lty = c(1,2,3),
col = c(1,2,3),
box.lwd = 0)
we fit covariance matrix of multivariate normal distribution with SAD1, it can be showed as
get_SAD1_covmatrix(c(2,0.5), n = 5)
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0.25 0.50 1.00 2.00 4.00
#> [2,] 0.50 1.25 2.50 5.00 10.00
#> [3,] 1.00 2.50 5.25 10.50 21.00
#> [4,] 2.00 5.00 10.50 21.25 42.50
#> [5,] 4.00 10.00 21.00 42.50 85.25
we can check initial parameters (k=4)
get_par_int(X = log10(df+1), k = 4, times = as.numeric(log10(colSums(df)+1)))
#> Error in nls(y ~ a * x^b, start = list(a = a, b = b), control = nls.control(maxiter = 1000, :
#> number of iterations exceeded maximum of 1000
#> Error in nls(y ~ a * x^b, start = list(a = a, b = b), control = nls.control(maxiter = 1000, :
#> number of iterations exceeded maximum of 1000
#> Error in nls(y ~ a * x^b, start = list(a = a, b = b), control = nls.control(maxiter = 1000, :
#> number of iterations exceeded maximum of 1000
#> Error in nls(y ~ a * x^b, start = list(a = a, b = b), control = nls.control(maxiter = 1000, :
#> number of iterations exceeded maximum of 1000
#> Error in nls(y ~ a * x^b, start = list(a = a, b = b), control = nls.control(maxiter = 1000, :
#> number of iterations exceeded maximum of 1000
#> Error in nls(y ~ a * x^b, start = list(a = a, b = b), control = nls.control(maxiter = 1000, :
#> number of iterations exceeded maximum of 1000
#> Error in nls(y ~ a * x^b, start = list(a = a, b = b), control = nls.control(maxiter = 1000, :
#> number of iterations exceeded maximum of 1000
#> Error in nls(y ~ a * x^b, start = list(a = a, b = b), control = nls.control(maxiter = 1000, :
#> number of iterations exceeded maximum of 1000
#> Error in nls(y ~ a * x^b, start = list(a = a, b = b), control = nls.control(maxiter = 1000, :
#> number of iterations exceeded maximum of 1000
#> Error in nls(y ~ a * x^b, start = list(a = a, b = b), control = nls.control(maxiter = 1000, :
#> number of iterations exceeded maximum of 1000
#> $initial_cov_params
#> [1] 0.50000 1.09618
#>
#> $initial_mu_params
#> a b
#> [1,] 1.222888e+15 -21.7299072
#> [2,] 1.275768e+01 -0.8462041
#> [3,] 1.930472e+02 -3.6046396
#> [4,] 2.241205e-35 50.4617985
#>
#> $initial_probibality
#>
#> 1 2 3 4
#> 0.24590164 0.06557377 0.40983607 0.27868852
#use kmeans to get initial centers
= kmeans(log10(df+1),4)$centers
tmp = power_equation_fit(tmp, trans = NULL)
tmp2 power_equation_plot(tmp2, label = NULL,n = 4)
idopNetowrk already wrapped the mean curve modelling, covariance
matrix modelling and likelihood ratio calculation into a function
fun_clu()
.
options(max.print = 10)
fun_clu(result1$original_data, k = 3, iter.max = 5)
#> Error in nls(y ~ a * x^b, start = list(a = a, b = b), control = nls.control(maxiter = 1000, :
#> number of iterations exceeded maximum of 1000
#> Error in nls(y ~ a * x^b, start = list(a = a, b = b), control = nls.control(maxiter = 1000, :
#> number of iterations exceeded maximum of 1000
#> initial value 1225.809010
#> iter 10 value 1131.213445
#> final value 1125.868093
#> converged
#>
#> iter = 1
#> Log-Likelihood = 1125.868
#> initial value 1118.632904
#> iter 10 value 1115.215807
#> iter 10 value 1115.215807
#> iter 10 value 1115.215807
#> final value 1115.215807
#> converged
#>
#> iter = 2
#> Log-Likelihood = 1115.216
#> initial value 1113.154283
#> final value 1112.161477
#> converged
#>
#> iter = 3
#> Log-Likelihood = 1112.161
#> initial value 1111.441671
#> final value 1111.183070
#> converged
#>
#> iter = 4
#> Log-Likelihood = 1111.183
#> initial value 1110.923567
#> final value 1110.833860
#> converged
#>
#> iter = 5
#> Log-Likelihood = 1110.834
#> initial value 1110.730889
#> final value 1110.694611
#> converged
#>
#> iter = 6
#> Log-Likelihood = 1110.695
#> $cluster_number
#> [1] 3
#>
#> $Log_likelihodd
#> [1] 1110.695
#>
#> $AIC
#> [1] 2241.389
#>
#> $BIC
#> [1] 2260.101
#>
#> $cov_par
#> [1] 0.3550810 0.7415481
#>
#> $mu_par
#> [,1] [,2]
#> [1,] 4.803180e+01 -1.683863
#> [2,] 1.595905e+13 -19.044478
#> [3,] 4.142849e-26 36.857826
#>
#> $probibality
#> [1] 0.08309011 0.46813139 0.44877850
#>
#> $omega
#> [,1] [,2] [,3]
#> [1,] 3.154770e-10 9.998988e-01 1.011960e-04
#> [2,] 1.159313e-10 9.978411e-01 2.158879e-03
#> [3,] 9.560057e-18 9.398919e-01 6.010813e-02
#> [ reached getOption("max.print") -- omitted 45 rows ]
#>
#> $cluster
#> MC_UC3 MC_UC4 YZJC_UC3 GQ_UC3 ZC_UC3 MC_UC2 GQ_UC1 GQ_UC4 HJC_UC3 ZC_UC4
#> HJC_UC1 HJC_UC4 HC_UC2 ZC_UC1 HC_UC4 JJC_UC2 HJC_UC2 MC_UC1 GQ_UC5 ZC_UC5
#> apply(omega, 1, which.max)
#> [ reached 'max' / getOption("max.print") -- omitted 48 rows ]
#>
#> $cluster2
#> MC_UC3 MC_UC4 YZJC_UC3 GQ_UC3 ZC_UC3 MC_UC2 GQ_UC1 GQ_UC4 HJC_UC3 ZC_UC4
#> HJC_UC1 HJC_UC4 HC_UC2 ZC_UC1 HC_UC4 JJC_UC2 HJC_UC2 MC_UC1 GQ_UC5 ZC_UC5
#> apply(omega, 1, which.max)
#> [ reached 'max' / getOption("max.print") -- omitted 48 rows ]
#>
#> [ reached getOption("max.print") -- omitted 2 entries ]
Usually we use multithread to calcuation k = 2-n and then to decide
best k , fun_clu_BIC
uses BIC to select best cluster number
by default.
= fun_clu_parallel(result1$original_data, start = 2, end = 5) result2
= which.min(sapply( result2 , "[[" , 'BIC' )) + 1 #skipped k = 1
best.k
best.k#> [1] 7
fun_clu_BIC(result = result2)
#we can direct give other k value
fun_clu_plot(result = result2, best.k = best.k)
In this part we select part of mustard data for demonstration purpose.
data("mustard_microbe")
= data_cleaning(mustard_microbe, x = 160) df2
= power_equation_fit(df2[,1:5]
res_l res_r = power_equation_fit(df2[,89:95])
res1 = data_match(result1 = res_l, result2 = res_r)
= bifun_clu_parallel(data1 = res1$dataset1$original_data,
res2 data2 = res1$dataset2$original_data,
Time1 = res1$dataset1$Time,
Time2 = res1$dataset2$Time,
start = 2,
end = 10,
thread = 9,
iter.max = 10)
= bifun_clu_parallel(data1 = res1$dataset1$original_data,
res2 data2 = res1$dataset2$original_data,
Time1 = res1$dataset1$Time,
Time2 = res1$dataset2$Time,
start = 2,
end = 10,
thread = 9,
iter.max = 10)
fun_clu_BIC(result = res2)
#we can set best.k directly
bifun_clu_plot(result = res2, best.k = 3, color1 = "#C060A1", color2 = "#59C1BD")
Sometimes a module is still too large for network reconstruction, which is determined by Dunbar’s number, we can further cluster it into sub-modules.
= bifun_clu_convert(res2, best.k = 3)
res3 = order(sapply(res3$a$Module.all,nrow))[5]
large.module
= fun_clu_select(result_fit = res1$dataset1, result_funclu = res3$a, i = large.module)
res_suba = fun_clu_select(result_fit = res1$dataset2, result_funclu = res3$b, i = large.module)
res_subb = power_equation_fit(res_suba$original_data)
dfsuba_l = power_equation_fit(res_subb$original_data)
dfsubb_r = data_match(result1 = dfsuba_l, result2 = dfsubb_r)
ressub1 = bifun_clu_parallel(data1 = ressub1$dataset1$original_data,
ressub2 data2 = ressub1$dataset2$original_data,
Time1 = ressub1$dataset1$Time,
Time2 = ressub1$dataset2$Time,
start = 2,
end = 5,
iter.max = 3)
= test_result$d2_subcluster ressub2
fun_clu_BIC(result = ressub2)
bifun_clu_plot(result = ressub2, best.k = 2, color1 = "#C060A1", color2 = "#59C1BD", degree = 1)
idopNetwork implements a LASSO-based procedure to choose a small set
of the most significant microbes/module that links with a given
microbes/modules. get_interaction()
return a compound list
contain the target microbe/module name, the most relevant
Modules/microbes names and the coefficients.
= fun_clu_convert(result2,best.k = best.k)
result3 = result3$original_data
df.module get_interaction(df.module,1)
#> $ind.name
#> [1] "M1"
#>
#> $dep.name
#> [1] "M5"
#>
#> $coefficient
#> [1] 5.951174
#we can the microbial relationship in Module1
= result3$Module.all$`1`
df.M1 get_interaction(df.M1,1)
#> Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
#> fold
#> Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
#> fold
#> Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
#> fold
#> $ind.name
#> [1] "Arthrobacter_oxydans"
#>
#> $dep.name
#> [1] "Fusobacterium_necrophorum"
#>
#> $coefficient
#> [1] 0.08263525
qdODE system is build based on variable selection results, it has unique ability to decompose the observed module/microbe abundance level into its independent component and dependent component, which can be used for inferring idopNetwork.
options(max.print = 10)
# first we test solving a qdODE
= lapply(1:best.k, function(c)get_interaction(df.module,c))
module.relationship = qdODE_all(result = result3, relationship = module.relationship, 1, maxit = 100)
ode.test #> iter 10 value 0.004336
#> final value 0.000002
#> converged
# we can view the result
qdODE_plot_base(ode.test)
# then we solve all qdODEs
= qdODE_parallel(result3) ode.module
qdODE_plot_all(ode.module)
= fun_clu_select(result_fit = result1, result_funclu = result3, i = 1)
result_m1 = qdODE_parallel(result_m1) ode.M1
qdODE_plot_all(ode.M1)
The final step of this guide is to visualization the multilayer network, and our package provide network_plot function to easily draw our idopNetwork. We can simply plug previous qdODE results into network_conversion function, and it will convert qdODE result for network visualization
= lapply(ode.module$ode_result, network_conversion)
net_module network_plot(net_module, title = "Module Network")
= lapply(ode.M1$ode_result, network_conversion)
net_m1 network_plot(net_m1, title = "M1 Network")
= qdODE_parallel(res3$a)
mustard_module_a = qdODE_parallel(res3$b)
mustard_module_b
= fun_clu_select(result_fit = res1$dataset1, result_funclu = res3$a, i = 3)
res_m1a = fun_clu_select(result_fit = res1$dataset2, result_funclu = res3$b, i = 3)
res_m1b = qdODE_parallel(res_m1a)
mustard_M1a = qdODE_parallel(res_m1b) mustard_M1b
<- lapply(mustard_module_a$ode_result, network_conversion)
mustard_m_a <- lapply(mustard_module_b$ode_result, network_conversion)
mustard_m_b
#set seed to make same random layout
layout(matrix(c(1,2),1,2,byrow=TRUE))
set.seed(1)
network_plot(mustard_m_a, title = "Module Network a")
set.seed(1)
network_plot(mustard_m_b, title = "Module Network b")
= fun_clu_select(result_fit = res1$dataset1, result_funclu = res3$a, i = 1)
result_m1a = fun_clu_select(result_fit = res1$dataset2, result_funclu = res3$b, i = 1)
result_m1b = qdODE_parallel(result_m1a, thread = 16)
ode.m1a = qdODE_parallel(result_m1b, thread = 16) ode.m1b
= lapply(ode.m1a$ode_result, network_conversion)
net_m1a = lapply(ode.m1b$ode_result, network_conversion)
net_m1b
#set seed to make same random layout
layout(matrix(c(1,2),1,2,byrow=TRUE))
set.seed(1)
network_plot(net_m1a, title = "Module1 a")
set.seed(1)
network_plot(net_m1b, title = "Module1 b")
object “LL.next” not found
This happens when parameters optimization failure, try rerun
cluster.
plot failure when using fun_clu_plot()
or
bifun_clu_plot()
This often happens when bad initial parameters is given and some cluster
is lost, try rerun cluster or use a smaller k.
sessionInfo()
#> R version 4.2.1 (2022-06-23 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 22621)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=C
#> [2] LC_CTYPE=English_United States.utf8
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=English_United States.utf8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] idopNetwork_0.1.2
#>
#> loaded via a namespace (and not attached):
#> [1] deSolve_1.34 shape_1.4.6 tidyselect_1.2.0
#> [4] xfun_0.37 bslib_0.4.2 reshape2_1.4.4
#> [7] splines_4.2.1 lattice_0.20-45 colorspace_2.1-0
#> [10] vctrs_0.5.2 generics_0.1.3 htmltools_0.5.4
#> [13] yaml_2.3.7 utf8_1.2.3 survival_3.3-1
#> [16] rlang_1.0.6 orthopolynom_1.0-6.1 jquerylib_0.1.4
#> [19] pillar_1.8.1 withr_2.5.0 glue_1.6.2
#> [22] foreach_1.5.2 lifecycle_1.0.3 plyr_1.8.8
#> [25] stringr_1.5.0 munsell_0.5.0 gtable_0.3.1
#> [28] mvtnorm_1.1-3 codetools_0.2-18 evaluate_0.20
#> [31] labeling_0.4.2 knitr_1.42 fastmap_1.1.1
#> [34] parallel_4.2.1 fansi_1.0.4 highr_0.10
#> [37] Rcpp_1.0.10 polynom_1.4-1 scales_1.2.1
#> [40] cachem_1.0.7 jsonlite_1.8.4 farver_2.1.1
#> [43] ggplot2_3.4.1 digest_0.6.31 stringi_1.7.12
#> [46] dplyr_1.1.0 grid_4.2.1 cli_3.6.0
#> [49] tools_4.2.1 magrittr_2.0.3 sass_0.4.5
#> [52] glmnet_4.1-6 patchwork_1.1.2 tibble_3.1.8
#> [55] pkgconfig_2.0.3 Matrix_1.5-3 rmarkdown_2.20
#> [58] rstudioapi_0.14 iterators_1.0.14 R6_2.5.1
#> [61] igraph_1.4.1 compiler_4.2.1