---
title: "Auto K-Means with healthyR.ai"
subtitle: "K-Means Series"
author: "Steven P. Sanderson II, MPH"
date: "`r Sys.Date()`"
output: 
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 2
vignette: >
  %\VignetteIndexEntry{Auto K-Means with healthyR.ai}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}

---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  message = FALSE,
  warning = FALSE,
  fig.width = 8,
  fig.height = 4.5,
  fig.align = 'center',
  out.width = '95%',
  dpi = 100
)
```

```{r setup}
library(healthyR.ai)
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(h2o))
```

# Data
Many times in a project we want to perform some sort of clustering on a given set
of data. This can be accomplished many different ways. This `vignette` will 
showcase how you can take a data set that is prepared, say like the internal
`iris` file and process it with the `healthyR.ai` function `hai_kmeans_automl()`.

First lets take a look at the data itself.

```{r iris_data}
df_tbl <- iris

glimpse(df_tbl)
```

From here we can see that the data is already prepared and ready to go. There is
a factor column that denotes the species or the `row` data and the columns are 
already numeric. Now the rest is fairly simple and straight forward. Let's use
the `hai_kmeans_automl()` function to create the list output that comes from it
where we will want to use the `Species` column as the predictor based upon the features
presented.

# Use the function
```{r automl_kmeas}
column_names <- names(iris)
target_col <- "Species"
predictor_cols <- setdiff(column_names, target_col)
```

Now we have our column inputs for the function, so we can go ahead and run it.

```{r run_auto_ml, eval=FALSE}
h2o.init()

output <- hai_kmeans_automl(
  .data = df_tbl,
  .predictors = predictor_cols,
  .standardize = FALSE
)

h2o.shutdown(prompt = FALSE)
```

This function gives a lot of output inside of it. From here we will discuss what 
comes out of the function.

# Function Output

Lets take a look at the structure of the output object. It is a list of lists with
four main components. They are the following:

-  data
-  auto_kmeans_obj
-  model_id (h2o model id)
-  scree_plt (a `ggplot2` object)

Lets explor each of these items.

## Data

Inside of the data list there are several sections. We can view and access these 
very simply. You will find that all of the outputs have been labeled in a very
simple to understand manner.

```{r data_section, eval=FALSE}
output$data
```

## Auto-ML Object

Now for the auto-ml object itself.

```{r autom_obj, eval=FALSE}
output$auto_kmeans_obj
```

## The Best Model

We also have in the output the best model that is saved off.

```{r best_model, eval=FALSE}
output$model_id
```

## Scree Plot

There is also a `ggplot2` scree plot that is generated, this helps us to understand
how many clusters are in the data resulting from minimizing the within sum of squares 
errors.

```{r scree_plot, eval=FALSE}
print(output$scree_plt)
```