Multivariate tools for compositional data analysis: the ToolsForCoDA package

Introduction

The ToolsForCoDa package was originally created in order to provide functions for a canonical correlation analysis with compositional data (Graffelman et al. (2018)), based on the centred logratio (clr) transformation of the compositions. Posteriorly, it has been extended with additional tools for the multivariate analysis of compositional data in the R environment. Currently, this package (version 1.0.9) provides functionality for

log-ratio principal component analysis (LR-PCA).
log-ratio canonical correlation analysis (LR-CCO).
log-ratio discriminant analysis (LR-LDA).

Both CCO and LDA rely on the inversion of a covariance matrix. The covariance matrix of the clr transformed compostions is structurally singular. The programs lrcco and lrlda resolve this with the use of a generalized inverse. Functionality for the analysis of compositional data in the R environment can be found in the packages compositions (Boogaart, Tolosana-Delgado, and Bren (2024)), robCompositions (Templ, Hron, and Filzmoser (2024)), easyCODA (Greenacre (2024)) and zCompositions (Palarea-Albaladejo and Martin-Fernandez (2024)). For further reading on compositional data, see the seminal text on compositional data by Aitchison (Aitchison (1986)) and several recent statistical textbooks (Pawlowsky-Glahn, Egozcue, and Tolosana-Delgado (2015), Filzmoser, Hron, and Templ (2018), Greenacre (2018), Boogaart and Tolosana-Delgado (2013)).

The remainder of this vignette shows an example session showing how to perform the three aforementioned types of analysis.

Installation
LR-PCA
LR-CCO
LR-LDA

1. Installation

#install.packages("ToolsForCoDa")
library(calibrate)
library(Correlplot)
library(ToolsForCoDa)

2. Logratio principal component analysis (LR-PCA)

We consider the composition of 37 Pinot Noir samples, consisting of the concentrations of Cd, Mo, Mn, Ni, Cu, Al, Ba, Cr, Sr, Pb, B, Mg, Si, Na, Ca, P, K and an evaluation of the wine’s aroma. (Frank and Kowalski (1984)).

data(PinotNoir)
head(PinotNoir)
#>      Cd    Mo    Mn    Ni    Cu    Al    Ba    Cr    Sr    Pb    B  Mg   Si
#> 1 0.005 0.044 1.510 0.122 0.830 0.982 0.387 0.029 1.230 0.561 2.63 128 17.3
#> 2 0.055 0.160 1.160 0.149 0.066 1.020 0.312 0.038 0.975 0.697 6.21 193 19.7
#> 3 0.056 0.146 1.100 0.088 0.643 1.290 0.308 0.035 1.140 0.730 3.05 127 15.8
#> 4 0.063 0.191 0.959 0.380 0.133 1.050 0.165 0.036 0.927 0.796 2.57 112 13.4
#> 5 0.011 0.363 1.380 0.160 0.051 1.320 0.380 0.059 1.130 1.730 3.07 138 16.7
#> 6 0.050 0.106 1.250 0.114 0.055 1.270 0.275 0.019 1.050 0.491 6.56 172 18.7
#>     Na    Ca   P    K Aroma
#> 1 66.8  80.5 150 1130   3.3
#> 2 53.3  75.0 118 1010   4.4
#> 3 35.4  91.0 161 1160   3.9
#> 4 27.5  93.6 120  924   3.9
#> 5 76.6  84.6 164 1090   5.6
#> 6 15.7 112.0 137 1290   4.6
Aroma <- PinotNoir[,c("Aroma")]

We apply closure to the chemical concentrations by division by their total, and use lrpca to do perform LR-PCA.

X  <- as.matrix(PinotNoir[,1:17])
Xc <- X/rowSums(X)
out.lrpca <- lrpca(Xc)

We study the decomposition of compositional variance, and the decay of the LR-PCA eigenvalues by means of a screeplot

round(out.lrpca$decom,2)
#>      PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8  PC9 PC10 PC11 PC12 PC13 PC14 PC15
#> la  1.44 1.00 0.59 0.47 0.25 0.25 0.18 0.14 0.12 0.07 0.04 0.04 0.02 0.01 0.01
#> laf 0.31 0.22 0.13 0.10 0.05 0.05 0.04 0.03 0.03 0.01 0.01 0.01 0.00 0.00 0.00
#> lac 0.31 0.53 0.66 0.76 0.81 0.87 0.90 0.93 0.96 0.97 0.98 0.99 1.00 1.00 1.00
#>     PC16 PC17
#> la     0    0
#> laf    0    0
#> lac    1    1
plot(1:ncol(out.lrpca$decom),
     out.lrpca$decom[1,],type="b",main="Scree-plot",
     xlab="PC",ylab="Eigenvalue")

We construct a covariance biplot, using jointlim to establish sensible limits for the x and y axes. Column markers for the clr transformed variables are multiplied by a constant (2.5) for a better visualization, and the amount of explained variance is indicated on the coordinate axes.

lims <- jointlim(out.lrpca$Fs,2.5*out.lrpca$Gp)
bplot(out.lrpca$Fs,2.5*out.lrpca$Gp,rowlab="",collab=colnames(Xc),rowch=1,colch=NA,
      xl=lims$xlim,yl=lims$ylim,main="Covariance biplot")

pc1lab <- paste("PC1 (",toString(round(100*out.lrpca$decom[2,1],1)),"%)",sep="")
pc2lab <- paste("PC2 (",toString(round(100*out.lrpca$decom[2,2],1)),"%)",sep="")

text(1,-0.1,pc1lab,cex=0.75)
text(0.1,1.5,pc2lab,cex=0.75,srt=90)

This biplot reveals that the logratio \(\ln{(Na/Pb)}\) has a large variance and is tightly correlated to the first principal component.

The variable Aroma correlates with the first principal components

cor(Aroma,out.lrpca$Fs[,1:2])
#>            PC1        PC2
#> [1,] 0.4128395 -0.4117683

and as the biplot suggests, Aroma correlates positively with the logratio \(\ln{(Cr/Sr)}\)

logScSr <- log(Xc[,c("Cr")]/Xc[,c("Sr")])
cor(Aroma,logScSr)
#> [1] 0.6469465

We note function lrpca also calculates condition indices, which may prove useful for detecting proportionality or one-dimensional relationships (Graffelman (2021)).

3. Logratio canonical correlation analysys (LR-CCO)

Two examples of LR-CCO are given below. The first example concerns a small artificial data set, where both the X and Y set are compositional, and is described in Section 3.1 of Graffelman et al. (2018). The second example concerns major oxides compositions of bentonites, where the X set is compositional and Y set is not.

3.1 Artificial data

We first load two artificial 3-part compositions.

data("Artificial")
Xsim.com <- Artificial$Xsim.com
Ysim.com <- Artificial$Ysim.com
colnames(Xsim.com) <- paste("X",1:3,sep="")
colnames(Ysim.com) <- paste("Y",1:3,sep="")

We make the ternary diagrams of the two sets of compositions

opar <- par(mfrow=c(1,2),mar=c(2,2,1,0)+0.5,pty="s")
par(mfg=c(1,1))
  ternaryplot(Xsim.com,pch=1)
par(mfg=c(1,2))
  ternaryplot(Ysim.com,pch=1)

par(opar)

We perform the compositional canonical analysis:

out.lrcco <- lrcco(Xsim.com,Ysim.com)

And we reproduce the results in Table 1 of Graffelman et al. (2018). The canonical correlations are obtained as

round(diag(out.lrcco$ccor),digits=3)
#> [1] 0.944 0.129 0.000

The canonical weights of the X set and the Y set are obtained by:

out.lrcco$A
#>               [,1]      [,2]          [,3]
#> [1,]  0.0008130933  3.847198 -1.110223e-15
#> [2,] -0.7985815849 -3.446655  6.522560e-16
#> [3,]  0.7977684917 -0.400543  1.665335e-16
out.lrcco$B
#>            [,1]        [,2]          [,3]
#> [1,]  0.7624647 -0.05038131 -9.714451e-17
#> [2,] -0.7165761 -0.52116661  3.608225e-16
#> [3,] -0.0458886  0.57154792 -2.775558e-16

The canonical loadings of the X set and the Y set are obtained by

out.lrcco$Rxu
#>          [,1]       [,2]       [,3]
#> X1 -0.8857398  0.4641822 -0.9035794
#> X2 -0.9828511 -0.1844012 -0.4438392
#> X3  0.9940477 -0.1089461  0.6849272
out.lrcco$Ryv
#>          [,1]       [,2]       [,3]
#> Y1  0.8522677 -0.5231058  0.2439545
#> Y2 -0.6097840 -0.7925676  0.9387752
#> Y3 -0.3033098  0.9528920 -0.8183778

The adequacy coefficients of the X set and the Y set:

out.lrcco$fitXs
#>            [,1]       [,2]      [,3]
#> AdeX  0.9128873 0.08711271 0.4941914
#> cAdeX 0.9128873 1.00000000 1.4941914
out.lrcco$fitYs
#>            [,1]      [,2]      [,3]
#> AdeY  0.3967312 0.6032688 0.5368516
#> cAdeY 0.3967312 1.0000000 1.5368516

The redundancy coefficients of the X set and the Y set

out.lrcco$fitXp
#>            [,1]        [,2]       [,3]
#> RedX  0.8132984 0.001442577 0.06638809
#> cRedX 0.8132984 0.814740980 0.88112907
out.lrcco$fitYp
#>            [,1]        [,2]      [,3]
#> RedY  0.3534509 0.009990066 0.1440308
#> cRedY 0.3534509 0.363441013 0.5074718

Finally, we make the full set of biplots for LR-CCO given in Figure 2 (Graffelman et al. (2018)). In each biplot, the canonical variates are multiplied by a convenient scalar to facilitate the visualization.

opar <- par(mfrow=c(2,2),mar=c(2,2,1,0)+0.5,mgp=c(2,1,0))
par(mfg=c(1,1))
#
# Figure A
#
Z <- rbind(out.lrcco$Fs,out.lrcco$Gp)
plot(Z[,1],Z[,2],type="n",xlim=c(-1,1),ylim=c(-1,1),asp=1,xlab="",ylab="",main="A")
arrows(0,0,Z[1:3,1],Z[1:3,2],col="red",angle=10,length=0.1)
arrows(0,0,Z[4:6,1],Z[4:6,2],col="blue",angle=10,length=0.1)
text(out.lrcco$Fs[,1],out.lrcco$Fs[,2],
     c(expression(X[1]),expression(X[2]),expression(X[3])))
text(out.lrcco$Gp[,1],out.lrcco$Gp[,2],
     c(expression(Y[1]),expression(Y[2]),expression(Y[3])),pos=c(4,3,1))
grid()
fa <- 0.15
points(fa*out.lrcco$U[,1],fa*out.lrcco$U[,2])

par(mfg=c(1,2))
#
# Figure B
#
Z <- rbind(out.lrcco$Fp,out.lrcco$Gs)
plot(Z[,1],Z[,2],type="n",xlim=c(-1.5,1.5),ylim=c(-1.5,1.5),asp=1,xlab="",ylab="",main="B")
arrows(0,0,Z[1:3,1],Z[1:3,2],col="red",angle=10,length=0.1)
arrows(0,0,Z[4:6,1],Z[4:6,2],col="blue",angle=10,length=0.1)
text(out.lrcco$Fp[,1],out.lrcco$Fp[,2],
     c(expression(X[1]),expression(X[2]),expression(X[3])))
text(out.lrcco$Gs[,1],out.lrcco$Gs[,2],
     c(expression(Y[1]),expression(Y[2]),expression(Y[3])),pos=c(4,3,1))
grid()
fa <- 0.25
points(fa*out.lrcco$V[,1],fa*out.lrcco$V[,2])

#
# Standardizing the clr transformed data
#
Xstan.clr <- scale(clrmat(Xsim.com))
Ystan.clr <- scale(clrmat(Ysim.com))
res.stan.cco <- canocov(Xstan.clr,Ystan.clr)
par(mfg=c(2,1))
#
# Figure C
#
Z <- rbind(res.stan.cco$Fs,res.stan.cco$Gp)
plot(Z[,1],Z[,2],type="n",xlim=c(-1,1),ylim=c(-1,1),asp=1,xlab="",ylab="",main="C")
arrows(0,0,Z[1:3,1],Z[1:3,2],col="red",angle=10,length=0.1)
arrows(0,0,Z[4:6,1],Z[4:6,2],col="blue",angle=10,length=0.1)
text(res.stan.cco$Fs[,1],res.stan.cco$Fs[,2],
     c(expression(X[1]),expression(X[2]),expression(X[3])))
text(res.stan.cco$Gp[,1],res.stan.cco$Gp[,2],
     c(expression(Y[1]),expression(Y[2]),expression(Y[3])),pos=c(4,3,1))
grid()
fa <- 0.2
points(fa*res.stan.cco$U[,1],fa*res.stan.cco$U[,2])
circle()
par(mfg=c(2,2))
#
# Figure D
#
Z <- rbind(res.stan.cco$Fp,res.stan.cco$Gs)
plot(Z[,1],Z[,2],type="n",xlim=c(-1.5,1.5),ylim=c(-1.5,1.5),asp=1,xlab="",ylab="",main="D")
arrows(0,0,Z[1:3,1],Z[1:3,2],col="red",angle=10,length=0.1)
arrows(0,0,Z[4:6,1],Z[4:6,2],col="blue",angle=10,length=0.1)
text(res.stan.cco$Fp[,1],res.stan.cco$Fp[,2],
     c(expression(X[1]),expression(X[2]),expression(X[3])))
text(res.stan.cco$Gs[,1],res.stan.cco$Gs[,2],
     c(expression(Y[1]),expression(Y[2]),expression(Y[3])),pos=c(4,3,1))
grid()
fa <- 0.25
points(fa*res.stan.cco$V[,1],fa*res.stan.cco$V[,2])
circle()

par(opar)

Panel A shows the logratios \(\log{(x_2/x_3)}\) and \(\log{(y_1/y_2)}\) to have long links that run parallel to the first canonical variate with the largest canonical correlation; these logratios are highly correlated. The canonical biplot shows the association between the two sets of compositions, which is not visible in the ternary diagrams above.

3.2 Canonical analysis of bentonites

In this subsection we treat the canonical analysis of bentonites. The X set concerns the concentrations of 9 major oxides, measured in 14 samples in the US (Cadrin et al. (1996)). A canonical analysis of this data set has been previously described (Reyment and Savazzi (1999)), and is extended here with biplots. The Y set concerns two isotopes, \(\delta D\) and \(\delta 18O\).

data("bentonites")
head(bentonites)
#>      Si    Al   Fe    Mn   Mg   Ca    K   Na   H20  dD d18O
#> 1 51.17 19.18 2.09 0.001 4.54 1.30 2.30 0.93  9.88  93 13.5
#> 2 50.66 19.01 1.67 0.001 2.70 1.70 0.39 0.67  9.99  92 21.9
#> 3 54.38 20.03 2.04 0.001 3.54 2.04 0.10 0.20 10.26  93 21.9
#> 4 55.58 18.76 0.56 0.001 4.51 2.00 0.31 0.90  9.54 100 24.6
#> 5 54.43 22.58 0.69 0.001 3.75 1.47 0.75 0.15  9.14 111 21.7
#> 6 62.79 20.75 1.14 0.060 4.26 0.14 0.42 0.40  8.28 114 24.1

We clr-transform and column-center the major oxides, after deletion of MnO which is outlying and had many zeros, which were replaced with 0.001. We standardize the isotopes.

X <- bentonites[,1:9]
X <- X[,-4]
X <- X/rowSums(X)
Y <- scale(bentonites[,10:11])
Xclr <- clrmat(X)
cco <- canocov(Xclr,Y)

The two canonical correlations are large:

diag(cco$ccor)
#> [1] 0.9656383 0.8244852

We construct a biplot of the data:

plot(cco$Fs[,1],cco$Fs[,2],col="red",pch=NA,xlab="First principal axis",
     ylab="Second principal axis",xlim=c(-1,1),ylim=c(-1,1),asp=1)
textxy(cco$Fs[,1],cco$Fs[,2],colnames(X),cex=0.75)
arrows(0,0,cco$Gp[,1],cco$Gp[,2],angle=10,col="blue",length=0.1)
arrows(0,0,cco$Fs[,1],cco$Fs[,2],angle=10,col="red",length=0.1)
text(cco$Gp[,1],cco$Gp[,2],colnames(Y),pos=c(3,1))
fa <- 0.45
points(fa*cco$U[,1],fa*cco$U[,2])
textxy(fa*cco$U[,1],fa*cco$U[,2],1:14)

We overplot the biplot with the canonical X-variates, which allows one to inspect the original samples (Graffelman (2005)). For plotting, the canonical variate is scaled with a convenient scaling factor (here 0.45). This factor does not affect the interpretation of the biplot, but gives the samples a convenient spread.

The logratio \(\log{(Na/Mg)}\) (among others) almost coincides with the first canonical variate, which correlates with \(\delta 18O\). However, interpretation should proceed with care because of the small sample size.

lnNaCa <- log(X[,c("Na")]/X[,c("Mg")])
cor(Y[,c("d18O")],lnNaCa)
#> [1] -0.7034501

4. Logratio discriminant analysis (LR-LDA)

We use archeological data from the UK (Tubb, Parker, and Nickless (1980)) to illustrate LR-LDA. This dataset consists of measurements of nine oxides on 48 archeological samples from three regions in the UK. We first prepare the data:

data(Tubb)
head(Tubb)
#>   Sample Al2O3 Fe2O3  MgO  CaO Na2O  K2O TiO2   MnO   BaO site
#> 1    GA1  18.8  9.52 2.00 0.79 0.40 3.20 1.01 0.077 0.015    G
#> 2    GA2  16.9  7.33 1.65 0.84 0.40 3.05 0.99 0.067 0.018    G
#> 3    GA3  18.2  7.64 1.82 0.77 0.40 3.07 0.98 0.087 0.014    G
#> 4    GA4  17.4  7.48 1.71 1.01 0.40 3.16 0.03 0.084 0.017    G
#> 5    GA5  16.9  7.29 1.56 0.76 0.40 3.05 1.00 0.063 0.019    G
#> 6    GB1  17.8  7.24 1.83 0.92 0.43 3.12 0.93 0.061 0.019    G
site             <- factor(Tubb$site)
Oxides           <- as.matrix(Tubb[,2:10])
rownames(Oxides) <- Tubb$Sample
Oxides           <- Oxides/rowSums(Oxides)

Next, we carry out LR-LDA by passing the compositions in Oxides to the function lrlda. Internally, lrlda applies the clr transformation of the data.

out.lrlda <- lrlda(Oxides,site)

The group sizes are obtained with:

out.lrlda$gsizes
#>  G NF  W 
#> 22 10 16

The group mean vectors of the clr transformed compositions are given by:

out.lrlda$Mclr
#>   Group.1    Al2O3    Fe2O3       MgO         CaO       Na2O      K2O
#> 1       G 3.011200 2.187485 0.7873925  0.09020565 -0.9765462 1.315820
#> 2      NF 4.348294 1.899077 1.0256504 -2.12623244 -1.5607332 2.175829
#> 3       W 2.807714 2.115195 1.8167997 -1.29407136 -1.3695227 1.622022
#>          TiO2       MnO       BaO
#> 1 -0.03735507 -2.485644 -3.892557
#> 2  1.47214116 -4.560993 -2.673032
#> 3 -0.08569854 -1.777495 -3.834943

The scores of the linear discriminant function are obtained by:

head(out.lrlda$LD)
#>             LD1      LD2
#> GA1  1.09082644 4.920326
#> GA2  0.05755742 4.460343
#> GA3  0.67388420 4.715621
#> GA4  1.55665675 4.892781
#> GA5 -0.38244898 4.436551
#> GB1  0.03938665 4.029670

The confusion matrix for the training observations is:

out.lrlda$CM
#>      pred
#> group  G NF  W
#>    G  22  0  0
#>    NF  0 10  0
#>    W   0  0 16

Posterior probabilities for the classifications are obtained by

head(round(out.lrlda$prob.posterior,4))
#>     G NF W
#> GA1 1  0 0
#> GA2 1  0 0
#> GA3 1  0 0
#> GA4 1  0 0
#> GA5 1  0 0
#> GB1 1  0 0

We extract biplot coordinates for group centers, individual observations and variables, and construct the LDA biplot.

Fp <- out.lrlda$Fp
Gs <- out.lrlda$Gs
LD <- out.lrlda$LD

colvec <- rep(NA,nrow(Oxides))
colvec[site=="G"]  <- "red"
colvec[site=="NF"] <- "green"
colvec[site=="W"]  <- "blue"


lims <- jointlim(LD,Fp)
opar <- par(bty="n",xaxt="n",yaxt="n")   
plot(Fp[,1],Fp[,2],asp=1,pch=17,xlab="",ylab="",col=c("red","green","blue"),
     xlim=lims$xlim,ylim=lims$ylim,cex=1.25)
points(LD[,1],LD[,2],col=colvec)
origin()
arrows(0,0,10*Gs[,1],10*Gs[,2],angle = 10, length = 0.1)
textxy(10*Gs[,1],10*Gs[,2],colnames(Oxides))
par(opar)
legend("topleft",c("G","NF","W"),col=c("red","green","blue"),pch=1,cex=0.5)

The LR-LDA biplot shows perfect separation of the three UK regions and suggests that a single logratio like \(\log{(MgO/Al2O3)}\) (among other possibilities) is capable of discriminating the three regions. A boxplot of this logratio confirms this.

lrMgOAl2O2 <- Oxides[,c("MgO")]/Oxides[,c("Al2O3")]
boxplot(lrMgOAl2O2~site,col=c("red","green","blue"))