--- title: "Clustering Multiple Nonparametric Curves" author: "[Nora M. Villanueva](https://noramvillanueva.github.io) & [Marta Sestelo](https://sestelo.github.io)" date: "`r Sys.Date()`" description: > This vignette describes the usage of the main functions included in clustcurv package with two real data sets. output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Clustering Multiple Nonparametric Curves} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", message = FALSE, warning = FALSE, tidy.opts = list( keep.blank.line = TRUE, width.cutoff = 150 ), options(width = 150,fig.width=12, fig.height=8) ) ``` > This vignette covers changes between versions 2.0.1 and 2.0.2. This document explains how to use [clustcurv](https://cran.r-project.org/package=clustcurv) R package for clustering multiple nonparametric curves, under the survival and regression framework. To this end, we illustrate the use of the package using some real data sets. In the case of the survival context, the algorithm to determine groups automatically is applied to Veterans' Administration Lung Cancer Data [survival](https://cran.r-project.org/package=survival) package. For the regression analysis, the clustcurv R package includes a data set called `data(barnacle5)` with measurements of rostro-carinal length and dry weight of barnacles collected from five sites of Galicia (northwest of Spain). # Clustering multiple survival curves We will use Veterans' Administration Lung Cancer Data `data(veteran)` to illustrate the package capabilities to build clusters of survival curves based on a covariate. This data set is available in survival package. In this study, a total of 137 males with advanced inoperable lung cancer were randomized either a standard or test chemotherapy. The primary endpoint for therapy comparison was time to death recorded for each patient. Other covariates were also recorded. One of them is the categorical variable `histological type of tumor` with four levels: squamous, small cell, adeno, and large cell ## Introduction After regular installation with `install.packages()`, then load the packages and the data set with ```{r} library(clustcurv) library(survival) head(veteran) ``` ## Algorithm for determing groups of survival curves ### Based on $K$-medians Clusters and estimates of the survival curves are obtained using the `ksurvcurves()` function or `survclustcurves()` function. The main difference between them is that `ksurvcurves()`, given a fixed value of $K$, allows determing the group for which each survival function belongs. In addition, `survclustcurves()` is able to determine automatically the number of groups. The functions will verify if data has been introduced correctly and will create `kcurves` and `clustcurves` objects, respectively. Both functions allow determining groups using the optimization algorithm $K$-means or $K$-medians (e.g. `algorithm = 'kmeans'`, or `algorithm = 'kmedians'`). The first three arguments must be introduced, where `time` is a vector with event-times, `status` for their corresponding indicator statuses, and `x` is the categorical covariate. One can be interesting to know the assignment of the survival curves to the group which they belong and the automatic selection of the number of groups. As we mentioned, it is possible by means of the `survclustcurves()`function. The following input command provides an example of the output using, as well, the $K$-medians algorithm (i.e. `algorithm = 'kmedians'`) ```{r} res <- survclustcurves(time = veteran$time, status = veteran$status, x = veteran$celltype, algorithm = 'kmedians', nboot = 100, seed=300716) ``` In the above function it is also included an argument for reducing executing time by means of parallelizing the testing procedure. This is `cluster = TRUE`. Related to this argument, the number of cores to be used in the parallelized procedure can be specified with the argument `ncores`. By default, `ncores = NULL`, so that the number is equal to the number of cores of the machine - 1. The following piece of code can be executed for obtaining a small summary of the fit ```{r} summary(res) ``` As can be seen, the `summary()` function, as well as the `print()` function, can be used to obtained some brief information about the output from `survclustcurves()`. The graphical representation of the fitted model can be easily obtained using the function `autoplot()`. The plot obtained, specifying the arguments `groups_by_color = FALSE` and `interactive = TRUE`, represents the estimated survival curves for each level of the factor nodes by means of the Kaplan-Meier estimator. As expected, the survival of patients can be influenced by the cellular type of the tumor. ```{r, out.width= '100%', fig.align = "center"} autoplot(res , groups_by_colour = FALSE, interactive = TRUE) ``` The assignment of the curves to the three groups can be observed in the following plot simply typing `groups_by_color = TRUE` ```{r,out.width= '100%', fig.align = "center"} autoplot(res , groups_by_colour = TRUE, interactive = TRUE) ``` ### Based on $K$-means Equivalently, the following piece of code shows the input commands and the results obtained with the `algorithm = 'kmeans'`. The number of groups and the assignments are equal as those ones obtained with the `algorithm = 'kmedians'`. ```{r} res2 <- survclustcurves(time = veteran$time, status = veteran$status, x = veteran$celltype, algorithm = 'kmeans', nboot = 100, seed=300716) ``` # Clustering multiple regression curves We will use barnacle’s growth data `data(barnacle5)` to illustrate the package capabilities to build clusters of regression curves based on a covariate. This data set (`barnacle5`) is available in the clustcurv package. A total of 5000 specimens were collected from five sites of the region’s Atlantic coastline and corresponds to the stretches of coast where this species is harvested: Punta do Mouro, Punta Lens, Punta de la Barca, Punta del Boy and Punta del Alba. Two biometric variables of each specimen were measured: `RC` (Rostro-carinal length, maximum distance across the capitulum between the ends of the rostral and carinal plates) and `DW` (Dry Weight). ```{r} data("barnacle5") head(barnacle5) ``` Here, the idea is to know the relation between `RC` and `DW` variables along the coast, i.e., to analyze if the barnacle’s growth is similar in all locations `F` or by contrast, if it is possible to detect geographical differentiation in growth. To do this, the `regclustcurves()` function will be used with the input variables `y`, `x`, `z`, by means of executing the following piece of code ```{r} fit.bar <- regclustcurves(y = barnacle5$DW, x = barnacle5$RC, z = barnacle5$F, nboot = 100, seed = 300716, algorithm = 'kmeans') ``` The output of this function can be observed with `print()` or `summary()` functions. Below, there is an example of this ```{r} print(fit.bar) summary(fit.bar) ``` Equivalent to the example with survival curves shown before, the results obtained above can be plotted using the `autoplot()` ```{r, out.width= '100%', fig.align = "center"} autoplot(fit.bar, groups_by_color = TRUE, interactive = TRUE) ``` > Never mind -> `install.packages('clustcurv')`