---
title: "Multivariate tools for compositional data analysis: the ToolsForCoDA package"
author:
- Jan Graffelman - Dpt. of Statistics, Universitat Politecnica de Catalunya; Dpt. of Biostatistics, University of Washington
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
bibliography: ToolsForCoDa.bib
vignette: >
%\VignetteIndexEntry{Multivariate tools for compositional data analysis: the ToolsForCoDA package}
%\VignetteEncoding{UTF-8}
%\VignetteEngine{knitr::rmarkdown}
editor_options:
chunk_output_type: console
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
knitr::opts_chunk$set(fig.align = "center")
knitr::opts_chunk$set(warnings = FALSE)
knitr::opts_chunk$set(message = FALSE)
knitr::opts_chunk$set(fig.width = 6, fig.height = 6)
rm(list=ls())
```
## Introduction
The **ToolsForCoDa** package was originally created in order to provide functions for a canonical correlation analysis with compositional data (@Graffelman2018), based on the centred logratio (clr) transformation of the compositions. Posteriorly, it has been extended with additional tools for the multivariate analysis of compositional data in the R environment. Currently, this package (version 1.0.9) provides functionality for
* log-ratio principal component analysis (LR-PCA).
* log-ratio canonical correlation analysis (LR-CCO).
* log-ratio discriminant analysis (LR-LDA).
Both CCO and LDA rely on the inversion of a covariance matrix. The covariance matrix of the clr transformed compostions is structurally singular. The programs `lrcco` and `lrlda` resolve this with the use of a generalized inverse. Functionality for the analysis of compositional data in the R environment can be found in the packages **compositions** (@compositions), **robCompositions** (@robCompositions), **easyCODA** (@easyCODA) and **zCompositions** (@zCompositions). For further reading on compositional data, see the seminal text on compositional data by Aitchison (@Aitchison) and several recent statistical textbooks (@Pawlowsky, @Filzmoser, @Greenacre, @VanDenBoogaart).
The remainder of this vignette shows an example session showing how to perform the three aforementioned types of analysis.
1. [Installation](#installation)
2. [LR-PCA](#lrpca)
3. [LR-CCO](#lrpco)
4. [LR-LDA](#lrlda)
## 1. Installation
```{r preinstall}
#install.packages("ToolsForCoDa")
library(calibrate)
library(Correlplot)
library(ToolsForCoDa)
```
## 2. Logratio principal component analysis (LR-PCA)
We consider the composition of 37 Pinot Noir samples, consisting of the concentrations of Cd, Mo, Mn, Ni, Cu, Al, Ba, Cr, Sr, Pb, B, Mg, Si, Na, Ca, P, K and an evaluation of the wine's aroma. (@FrankKowalski).
```{r pinotnoir}
data(PinotNoir)
head(PinotNoir)
Aroma <- PinotNoir[,c("Aroma")]
```
We apply closure to the chemical concentrations by division by their total, and
use `lrpca` to do perform LR-PCA.
```{r closure}
X <- as.matrix(PinotNoir[,1:17])
Xc <- X/rowSums(X)
out.lrpca <- lrpca(Xc)
```
We study the decomposition of compositional variance, and the decay of the LR-PCA eigenvalues by means of a screeplot
```{r decomposition}
round(out.lrpca$decom,2)
plot(1:ncol(out.lrpca$decom),
out.lrpca$decom[1,],type="b",main="Scree-plot",
xlab="PC",ylab="Eigenvalue")
```
We construct a covariance biplot, using `jointlim` to establish sensible limits for the x and y axes. Column markers for the clr transformed variables are multiplied by a constant (2.5) for a better visualization, and the amount of explained variance is indicated on the coordinate axes.
```{r biplot}
lims <- jointlim(out.lrpca$Fs,2.5*out.lrpca$Gp)
bplot(out.lrpca$Fs,2.5*out.lrpca$Gp,rowlab="",collab=colnames(Xc),rowch=1,colch=NA,
xl=lims$xlim,yl=lims$ylim,main="Covariance biplot")
pc1lab <- paste("PC1 (",toString(round(100*out.lrpca$decom[2,1],1)),"%)",sep="")
pc2lab <- paste("PC2 (",toString(round(100*out.lrpca$decom[2,2],1)),"%)",sep="")
text(1,-0.1,pc1lab,cex=0.75)
text(0.1,1.5,pc2lab,cex=0.75,srt=90)
```
This biplot reveals that the logratio $\ln{(Na/Pb)}$ has a large variance and is tightly correlated to the first principal component.
The variable `Aroma` correlates with the first principal components
```{r aroma}
cor(Aroma,out.lrpca$Fs[,1:2])
```
and as the biplot suggests, `Aroma` correlates positively with the logratio $\ln{(Cr/Sr)}$
```{r correlation}
logScSr <- log(Xc[,c("Cr")]/Xc[,c("Sr")])
cor(Aroma,logScSr)
```
We note function `lrpca` also calculates condition indices, which may prove useful for detecting proportionality or one-dimensional relationships (@Graffelman2021).
## 3. Logratio canonical correlation analysys (LR-CCO)
Two examples of LR-CCO are given below. The first example concerns a small artificial data set, where both the X and Y set are compositional, and is described in Section 3.1 of Graffelman et al. (2018). The second example concerns major oxides compositions of bentonites, where the X set is compositional and Y set is not.
### 3.1 Artificial data
We first load two artificial 3-part compositions.
```{r artificial}
data("Artificial")
Xsim.com <- Artificial$Xsim.com
Ysim.com <- Artificial$Ysim.com
colnames(Xsim.com) <- paste("X",1:3,sep="")
colnames(Ysim.com) <- paste("Y",1:3,sep="")
```
We make the ternary diagrams of the two sets of compositions
```{r ternaries, fig.width = 6, fig.height = 3}
opar <- par(mfrow=c(1,2),mar=c(2,2,1,0)+0.5,pty="s")
par(mfg=c(1,1))
ternaryplot(Xsim.com,pch=1)
par(mfg=c(1,2))
ternaryplot(Ysim.com,pch=1)
par(opar)
```
We perform the compositional canonical analysis:
```{r lrcco}
out.lrcco <- lrcco(Xsim.com,Ysim.com)
```
And we reproduce the results in Table 1 of Graffelman et al. (2018). The canonical correlations are obtained as
```{r cancortab}
round(diag(out.lrcco$ccor),digits=3)
```
The canonical weights of the X set and the Y set are obtained by:
```{r canweights}
out.lrcco$A
out.lrcco$B
```
The canonical loadings of the X set and the Y set are obtained by
```{r canloadings}
out.lrcco$Rxu
out.lrcco$Ryv
```
The adequacy coefficients of the X set and the Y set:
```{r canadequacy}
out.lrcco$fitXs
out.lrcco$fitYs
```
The redundancy coefficients of the X set and the Y set
```{r canredundancey}
out.lrcco$fitXp
out.lrcco$fitYp
```
Finally, we make the full set of biplots for LR-CCO given in Figure 2 (@Graffelman2018). In each biplot, the canonical variates are multiplied by a convenient scalar to facilitate
the visualization.
```{r panelbiplots}
opar <- par(mfrow=c(2,2),mar=c(2,2,1,0)+0.5,mgp=c(2,1,0))
par(mfg=c(1,1))
#
# Figure A
#
Z <- rbind(out.lrcco$Fs,out.lrcco$Gp)
plot(Z[,1],Z[,2],type="n",xlim=c(-1,1),ylim=c(-1,1),asp=1,xlab="",ylab="",main="A")
arrows(0,0,Z[1:3,1],Z[1:3,2],col="red",angle=10,length=0.1)
arrows(0,0,Z[4:6,1],Z[4:6,2],col="blue",angle=10,length=0.1)
text(out.lrcco$Fs[,1],out.lrcco$Fs[,2],
c(expression(X[1]),expression(X[2]),expression(X[3])))
text(out.lrcco$Gp[,1],out.lrcco$Gp[,2],
c(expression(Y[1]),expression(Y[2]),expression(Y[3])),pos=c(4,3,1))
grid()
fa <- 0.15
points(fa*out.lrcco$U[,1],fa*out.lrcco$U[,2])
par(mfg=c(1,2))
#
# Figure B
#
Z <- rbind(out.lrcco$Fp,out.lrcco$Gs)
plot(Z[,1],Z[,2],type="n",xlim=c(-1.5,1.5),ylim=c(-1.5,1.5),asp=1,xlab="",ylab="",main="B")
arrows(0,0,Z[1:3,1],Z[1:3,2],col="red",angle=10,length=0.1)
arrows(0,0,Z[4:6,1],Z[4:6,2],col="blue",angle=10,length=0.1)
text(out.lrcco$Fp[,1],out.lrcco$Fp[,2],
c(expression(X[1]),expression(X[2]),expression(X[3])))
text(out.lrcco$Gs[,1],out.lrcco$Gs[,2],
c(expression(Y[1]),expression(Y[2]),expression(Y[3])),pos=c(4,3,1))
grid()
fa <- 0.25
points(fa*out.lrcco$V[,1],fa*out.lrcco$V[,2])
#
# Standardizing the clr transformed data
#
Xstan.clr <- scale(clrmat(Xsim.com))
Ystan.clr <- scale(clrmat(Ysim.com))
res.stan.cco <- canocov(Xstan.clr,Ystan.clr)
par(mfg=c(2,1))
#
# Figure C
#
Z <- rbind(res.stan.cco$Fs,res.stan.cco$Gp)
plot(Z[,1],Z[,2],type="n",xlim=c(-1,1),ylim=c(-1,1),asp=1,xlab="",ylab="",main="C")
arrows(0,0,Z[1:3,1],Z[1:3,2],col="red",angle=10,length=0.1)
arrows(0,0,Z[4:6,1],Z[4:6,2],col="blue",angle=10,length=0.1)
text(res.stan.cco$Fs[,1],res.stan.cco$Fs[,2],
c(expression(X[1]),expression(X[2]),expression(X[3])))
text(res.stan.cco$Gp[,1],res.stan.cco$Gp[,2],
c(expression(Y[1]),expression(Y[2]),expression(Y[3])),pos=c(4,3,1))
grid()
fa <- 0.2
points(fa*res.stan.cco$U[,1],fa*res.stan.cco$U[,2])
circle()
par(mfg=c(2,2))
#
# Figure D
#
Z <- rbind(res.stan.cco$Fp,res.stan.cco$Gs)
plot(Z[,1],Z[,2],type="n",xlim=c(-1.5,1.5),ylim=c(-1.5,1.5),asp=1,xlab="",ylab="",main="D")
arrows(0,0,Z[1:3,1],Z[1:3,2],col="red",angle=10,length=0.1)
arrows(0,0,Z[4:6,1],Z[4:6,2],col="blue",angle=10,length=0.1)
text(res.stan.cco$Fp[,1],res.stan.cco$Fp[,2],
c(expression(X[1]),expression(X[2]),expression(X[3])))
text(res.stan.cco$Gs[,1],res.stan.cco$Gs[,2],
c(expression(Y[1]),expression(Y[2]),expression(Y[3])),pos=c(4,3,1))
grid()
fa <- 0.25
points(fa*res.stan.cco$V[,1],fa*res.stan.cco$V[,2])
circle()
par(opar)
```
Panel A shows the logratios $\log{(x_2/x_3)}$ and $\log{(y_1/y_2)}$ to have long links that run parallel to the first canonical variate with the largest canonical correlation; these logratios are highly correlated. The canonical biplot shows the association between the two sets of compositions, which is not visible in the ternary diagrams above.
### 3.2 Canonical analysis of bentonites
In this subsection we treat the canonical analysis of bentonites. The X set concerns the concentrations of 9 major oxides, measured in 14 samples in the US (@Cadrin). A canonical analysis of this data set has been previously described (@Reyment), and is extended here with biplots. The Y set concerns two isotopes, $\delta D$ and $\delta 18O$.
```{r bentonites}
data("bentonites")
head(bentonites)
```
We clr-transform and column-center the major oxides, after deletion of MnO which is outlying and had many zeros, which were replaced with 0.001. We standardize the isotopes.
```{r clrbentonites}
X <- bentonites[,1:9]
X <- X[,-4]
X <- X/rowSums(X)
Y <- scale(bentonites[,10:11])
Xclr <- clrmat(X)
cco <- canocov(Xclr,Y)
```
The two canonical correlations are large:
```{r twocancor}
diag(cco$ccor)
```
We construct a biplot of the data:
```{r biplotbentonites}
plot(cco$Fs[,1],cco$Fs[,2],col="red",pch=NA,xlab="First principal axis",
ylab="Second principal axis",xlim=c(-1,1),ylim=c(-1,1),asp=1)
textxy(cco$Fs[,1],cco$Fs[,2],colnames(X),cex=0.75)
arrows(0,0,cco$Gp[,1],cco$Gp[,2],angle=10,col="blue",length=0.1)
arrows(0,0,cco$Fs[,1],cco$Fs[,2],angle=10,col="red",length=0.1)
text(cco$Gp[,1],cco$Gp[,2],colnames(Y),pos=c(3,1))
fa <- 0.45
points(fa*cco$U[,1],fa*cco$U[,2])
textxy(fa*cco$U[,1],fa*cco$U[,2],1:14)
```
We overplot the biplot with the canonical X-variates, which allows one to inspect the original samples (@Graffelman2005). For plotting, the canonical variate is scaled with a convenient scaling factor (here 0.45). This factor does not affect the interpretation of the biplot, but gives the samples a convenient spread.
The logratio $\log{(Na/Mg)}$ (among others) almost coincides with the first canonical variate, which correlates with $\delta 18O$. However, interpretation should proceed with care because of the small sample size.
```{r lognaca}
lnNaCa <- log(X[,c("Na")]/X[,c("Mg")])
cor(Y[,c("d18O")],lnNaCa)
```
## 4. Logratio discriminant analysis (LR-LDA)
We use archeological data from the UK (@Tubb) to illustrate LR-LDA. This dataset consists of measurements of nine oxides on 48 archeological samples from three regions in the UK. We first prepare the data:
```{r tubbdata}
data(Tubb)
head(Tubb)
site <- factor(Tubb$site)
Oxides <- as.matrix(Tubb[,2:10])
rownames(Oxides) <- Tubb$Sample
Oxides <- Oxides/rowSums(Oxides)
```
Next, we carry out LR-LDA by passing the compositions in `Oxides` to the function `lrlda`. Internally, `lrlda` applies the clr transformation of the data.
```{r lrldatubbdata}
out.lrlda <- lrlda(Oxides,site)
```
The group sizes are obtained with:
```{r groupsizes}
out.lrlda$gsizes
```
The group mean vectors of the clr transformed compositions are given by:
```{r groupmeans}
out.lrlda$Mclr
```
The scores of the linear discriminant function are obtained by:
```{r ldscores}
head(out.lrlda$LD)
```
The confusion matrix for the training observations is:
```{r confusion}
out.lrlda$CM
```
Posterior probabilities for the classifications are obtained by
```{r posteriorprobs}
head(round(out.lrlda$prob.posterior,4))
```
We extract biplot coordinates for group centers, individual observations and variables, and construct the LDA biplot.
```{r biplotcoordinates}
Fp <- out.lrlda$Fp
Gs <- out.lrlda$Gs
LD <- out.lrlda$LD
colvec <- rep(NA,nrow(Oxides))
colvec[site=="G"] <- "red"
colvec[site=="NF"] <- "green"
colvec[site=="W"] <- "blue"
lims <- jointlim(LD,Fp)
opar <- par(bty="n",xaxt="n",yaxt="n")
plot(Fp[,1],Fp[,2],asp=1,pch=17,xlab="",ylab="",col=c("red","green","blue"),
xlim=lims$xlim,ylim=lims$ylim,cex=1.25)
points(LD[,1],LD[,2],col=colvec)
origin()
arrows(0,0,10*Gs[,1],10*Gs[,2],angle = 10, length = 0.1)
textxy(10*Gs[,1],10*Gs[,2],colnames(Oxides))
par(opar)
legend("topleft",c("G","NF","W"),col=c("red","green","blue"),pch=1,cex=0.5)
```
The LR-LDA biplot shows perfect separation of the three UK regions and suggests that a single logratio like $\log{(MgO/Al2O3)}$ (among other possibilities) is capable of discriminating the three regions. A boxplot of
this logratio confirms this.
```{r lrmgoal2o3}
lrMgOAl2O2 <- Oxides[,c("MgO")]/Oxides[,c("Al2O3")]
boxplot(lrMgOAl2O2~site,col=c("red","green","blue"))
```
## References