---
title: "TDAvec vignette"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{TDAvec-vignette}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

Persistence diagrams are a fundamental tool in topological data analysis (TDA) that summarize the multi-scale topological features (such as connected components, loops, and voids) of a dataset. They represent these features as points in a 2D plane, where each point corresponds to the birth and death of a feature across different scales.

Since persistence diagrams are multisets of points in a 2D plane, their unordered structure makes them challenging to use directly in machine learning models. By vectorizing them (i.e., transforming into fixed-length vectors), conventional statistical and machine learning techniques can be applied while preserving topological information. The `TDAvec` R package is specifically designed for this task, offering 13 vectorization methods commonly used in TDA. These methods are divided into three broad categories:

  - Functional vector summaries - based on summary functions:
    - Betti curve 
    - Euler characteristic curve
    - Normalized life curve 
    - Persistence block
    - Persistence surface
    - Persistence landscape function
    - Persistence silhouette function
    - Persistent entropy summary function
    - Template function

  - Algebraic vector summaries - based on polynomial maps:
    - Algebraic functions
    - Complex polynomial coefficients
    - Tropical coordinate function
     
  - Statistical vector summaries - based on descriptive statistics:
    - Basic descriptive statistics

For improved computational efficiency, all code behind the vector summaries of `TDAvec` is written in `C++` using the `Rcpp` and `RcppArmadillo` packages. In this vignette, we illustrate the basic usage of the package using simple examples.   

Let's first load the required libraries. 

```{r setup}
library(TDAvec)
library(TDAstats) # to compute persistence diagrams
```

Now, we generate a 2D point cloud of size 100 sampled uniformly from a unit circle with added Gaussian noise:

```{r}
# the number of points to sample
N <- 100 
# set a random seed for reproducibility
set.seed(123) 
# sample N points uniformly from the unit circle and add Gaussian noise
theta <- runif(N, min = 0, max = 2 * pi)
X <- cbind(cos(theta), sin(theta)) + rnorm(2 * N, mean = 0, sd = 0.2)
# plot the point cloud
plot(X,pch = 20,asp = 1,xlab = 'x',ylab = 'y')
```

Next, we use the `TDAstats` package to compute the persistence diagram (PD) of a Vietoris-Rips filtration built on top of the point cloud $X$.

```{r}
D <- calculate_homology(X,dim=1,threshold=2)
sum(D[,1]==0) # number of connected components
sum(D[,1]==1) # number of loops
sum(D[,1]==2) # number of voids
```

In the `calculate_homology()` function, `dim` is the maximum homological dimension of the topological features to be computed (connected components if `dim=0`; connected components and loops if `dim=1`; connected components, loops and voids if `dim=2`, etc.). `threshold` is the maximum value of the scale parameter of the filtration (which we set equal to 2 since the points are sampled from a circle with diameter 2). 

The persistence diagram $D$ has `r sum(D[,1]==0)` connected components (the point cloud size - 1; `TDAstats` drops the connected component with infinite death value), `r sum(D[,1]==1)` loops (one with long life-span, the rest are short-lived) and `r sum(D[,1]==2)` voids along with their `birth` and `death` values. To plot the diagram, we can use the `plot_persisit()` function. 

```{r}
plot_persist(D)
```

In the plot, the solid dots and triangles represent connected components and loops respectively. 

Let's compute a vector summary of one of the simplest summary functions, the Betti curve, for homological dimension $H_1$.

```{r}
# sequence of scale values to be used for vectorization
scaleSeq = seq(0,1.5,length.out=16) 
# vectorize the Betti curve for homological dimension H_1
computeBettiCurve(D,homDim=1,scaleSeq)
```

By default, vectorization in this case is performed by computing the average values of the Betti curve over consecutive intervals defined by an increasing sequence of scale points (`evaluate`="intervals"). This vectorization method is adopted as the default for all univariate summary functions that are easy to integrate. More specifically, if $f$ is a (univariate) summary function and $t_1,t_2,\ldots,t_n$ are increasing scale values, we discretize $f$ as:

\begin{equation}
\Big(\frac{1}{\Delta t_1}\int_{t_1}^{t_2}f(t)dt,\frac{1}{\Delta t_2}\int_{t_2}^{t_3}f(t)dt,\ldots,\frac{1}{\Delta t_{n-1}}\int_{t_{n-1}}^{t_n}f(t)dt\Big)\in\mathbb{R}^{n-1},
\end{equation}

where $\Delta t_k=t_{k+1}-t_k$, $k=1,\ldots,n-1$. 

Alternatively, by setting the `evaluate` argument to "points", one can vectorize the Betti curve and all the other univariate summary functions by evaluating them at each scale point and arranging the values into a vector:

\begin{equation}
(f(t_1),f(t_2),\ldots,f(t_n))\in\mathbb{R}^{n},
\end{equation}

As a note, the `computePersistenceLandscape()` function supports both classic and generalized persistence landscape functions. For example:

```{r}
# classic
computePersistenceLandscape(D,homDim=1,scaleSeq,k=3) # k = the number of persistence landscape functions to consider (default is 1)

# generalized
computePersistenceLandscape(D,homDim=1,scaleSeq,k=3,generalized=TRUE,kernel = "epanechnikov",h=0.2) # h = bandwidth for the kernel function
```

Computing algebraic and statistical vector summaries follows a similar approach. For example:

```{r}
computeAlgebraicFunctions(D,homDim=0)  
```

returns four algebraic functions based on the birth and death values (refer to the help documentation for more details).

_Persistence surface_ and _persistence block_ are bivariate summary functions and to vectorize them, we first need to switch from the birth-death to the birth-persistence coordinates: 

```{r}
D[,3] <- D[,3] - D[,2] 
colnames(D)[3] <- "Persistence"
```

Below, we compute the _persistence image_, which is a vector summary of the _persistence surface_:

```{r}
# Persistence Image (PI)
resB <- 5 # resolution (or grid size) along the birth axis
resP <- 5 # resolution (or grid size) along the persistence axis 
# find min and max persistence values for dimension H_0
minPH0 <- min(D[D[,1]==0,3]); maxPH0 <- max(D[D[,1]==0,3])
# construct one-dimensional grid of scale values
ySeqH0 <- seq(minPH0,maxPH0,length.out=resP+1)
# default way of selecting the standard deviation sigma of the Gaussians on top of each point of the diagram
sigma <- 0.5*(maxPH0-minPH0)/resP 
# compute PI for homological dimension H_0
computePersistenceImage(D,homDim=0,xSeq=NA,ySeqH0,sigma)
```

Since the $H_0$ features all have the birth value of zero in this case, a one-dimensional grid of scale values is used for vectorization.  

For homological dimension $H_1$, the birth values are not all the same and therefore the vectorization is performed over a two-dimensional grid. 

```{r}
# Persistence Image (PI)
# find min & max birth and persistence values for dimension H_1
minBH1 <- min(D[D[,1]==1,2]); maxBH1 <- max(D[D[,1]==1,2])
minPH1 <- min(D[D[,1]==1,3]); maxPH1 <- max(D[D[,1]==1,3])
xSeqH1 <- seq(minBH1,maxBH1,length.out=resB+1)
ySeqH1 <- seq(minPH1,maxPH1,length.out=resP+1)
sigma <- 0.5*(maxPH1-minPH1)/resP
# compute PI for homological dimension H_1
computePersistenceImage(D,homDim=1,xSeqH1,ySeqH1,sigma) 
```
The code for `computePersistenceImage()` is adopted from the `pers.image()` function (available in the `kernelTDA` package) with some modifications. For example, `pers.image()` uses a one-dimensional grid for homological dimension $H_0$ regardless of the filtration type. In contrast, `computePersistenceImage()` uses a one-dimensional grid only if additionally the birth values are the same (which may not be true for some filtration types). Moreover, `pers.image()` uses a square grid (e.g., 10 by 10) for vectorization, whereas `computePersistenceImage()` is not limited to such a grid and can compute vector summaries using a rectangular grid (e.g., 10 by 20). 

#### References

1. Chazal, F., & Michel, B. (2021). An Introduction to Topological Data Analysis: Fundamental and Practical Aspects for Data Scientists. _Frontiers in Artificial Intelligence_, 108.

2. Chung, Y. M., & Lawson, A. (2022). Persistence curves: A canonical framework for summarizing persistence diagrams. _Advances in Computational Mathematics_, 48(1), 1-42.

3. Bubenik, P. (2015). Statistical topological data analysis using persistence landscapes. _Journal of Machine Learning Research_, 16(1), 77-102. 

4. Berry, E., Chen, Y. C., Cisewski-Kehe, J., & Fasy, B. T. (2020). Functional summaries of persistence diagrams. _Journal of Applied and Computational Topology_, 4(2):211–262.

5. Adcock, A., Carlsson, E. and Carlsson, G., 2013. The ring of algebraic functions on persistence bar codes. Homology, Homotopy Appl., 18:381–402, 2016.

6. Ali, D., Asaad, A., Jimenez, M.J., Nanda, V., Paluzo-Hidalgo, E. and Soriano-Trigueros, M., (2023). A survey of vectorization methods in topological data analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence.

7. Adams, H., Emerson, T., Kirby, M., Neville, R., Peterson, C., Shipman, P., ... & Ziegelmeier, L. (2017). Persistence images: A stable vector representation of persistent homology. _Journal of Machine Learning Research_, 18.