--- title: "Introduction to ggseqplot" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to ggseqplot} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} csl: apa.csl bibliography: references.bib editor_options: markdown: wrap: 72 --- ```{r prelude, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width=8, fig.height=4.94 ) old <- options() on.exit(options(old)) options(rmarkdown.html_vignette.check_title = FALSE) pkgs <- c("colorspace", "ggplot2", "ggthemes", "ggseqplot", "hrbrthemes", "patchwork", "purrr", "TraMineR") # Load all packages to library and adjust options lapply(pkgs, library, character.only = TRUE) ``` ## Prelude Following Fasang and Liao [-@fasang2014], we distinguish between sequence representation and summarization graphs. The latter aggregate and summarize the information stored in the sequence data without plotting actual observed sequences. Given the complexity of sequence data, these type of plots focus on one or two dimensions of information stored in sequence data [@brzinsky-fay2014]. Among the diverse members of the family of summarization graphs are sequence transitions plot, Kaplan-Meier survival curves, modal state plots, mean time plots, state distribution plots, and entropy plots [@fasang2014; @raab2022]. [`{ggseqplot}`](https://maraab23.github.io/ggseqplot/) includes five summarization graphs: - state distribution plots (`ggseqdplot`) - entropy line plots (`ggseqplot`) - modal state sequence plot (`ggseqmsplot`) - mean time plot (`ggseqmtplot`) - transition rate plots (`ggseqtrplot`) Whereas summarization graphs aggregate the sequence data, representation plots always display actually observed sequences. In the most basic form of the traditional sequence index plot all observed sequences are displayed. In data sets with several hundred cases this kind of visualization, however, cannot be reasonably applied because of the issue of overplotting. In such a scenario individual sequences are partly plotted on top of each other and the resulting graph would be an inaccurate representation of the underlying sequence data. In response to this issue alternative representation plots which only render a subset of the sequences have been suggested. [`{ggseqplot}`](https://maraab23.github.io/ggseqplot/) allows to render both the traditional sequence index plots and representation plots of subsets of sequences. Specifically, the library contains the following plot types: - sequence index plot (`ggseqiplot`) - sequence frequency plot (`ggseqfplot`) - representative sequence plot (`ggseqrplot`) - relative frequency sequence plot (`ggseqrfplot`) For a more detailed discussion of sequence visualization I recommend the following articles/book chapters: Brzinsky-Fay [-@brzinsky-fay2014], Fasang and Liao [-@fasang2014] and Chapter 2 of Raab & Struffolino [-@raab2022]. With the exception of the transition rate plot all of the plots listed above can be also produced with [`{TraMineR}`](http://traminer.unige.ch){target="_blank"}. In total, those two libraries provide a much more comprehensive set of plots and often allow for more options than {ggseqplot}. Hence, if you are an experienced user base R's `plot` (which is used to render the [`{TraMineR}`](http://traminer.unige.ch){target="_blank"} plots), there is no real need to make yourself acquainted with [`{ggseqplot}`](https://maraab23.github.io/ggseqplot/). [`{ggseqplot}`](https://maraab23.github.io/ggseqplot/) was written because like many other R users I prefer [`{ggplot2}`](https://ggplot2.tidyverse.org/){target="_blank"} to base R's `plot` environment for visualizing data. [`{TraMineR}`](http://traminer.unige.ch){target="_blank"} [@gabadinho2011] was developed before [`{ggplot2}`](https://ggplot2.tidyverse.org/){target="_blank"} [@wickham2016] was as popular as it is today and back then many users were more familiar with coding base R plots. To date, however, many researchers and students are more accustomed to using [`{ggplot2}`](https://ggplot2.tidyverse.org/){target="_blank"} and prefer to draw on the related skills and experiences instead of learning how to refine base R plots just for the single purpose of visualizing sequence data. This vignette outlines how sequence data generated with `TraMineR::seqdef` are reshaped to plot them as ggplot2-typed figures using [`{ggseqplot}`](https://maraab23.github.io/ggseqplot/). More specifically, it gives an overview of the general procedure and depicts which [`{TraMineR}`](http://traminer.unige.ch){target="_blank"} and [`{ggplot2}`](https://ggplot2.tidyverse.org/){target="_blank"} functions are used to render the plots The vignette further illustrates how the appearance of plots produced with [`{ggseqplot}`](https://maraab23.github.io/ggseqplot/) can be changed using [`{ggplot2}`](https://ggplot2.tidyverse.org/){target="_blank"} functions and extensions. ## Setup example We start by loading the required libraries and defining the sequence data to be plotted. We draw in the examples from the [`{TraMineR}`](http://traminer.unige.ch){target="_blank"} for setting up the examples. \
**Click to see code for installing and loading required packages** ```{r libraries, eval=FALSE} ## ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ## Load and download (if necessary) required packages ---- ## ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ## Save package names as a vector of strings pkgs <- c("colorspace", "ggplot2", "ggthemes", "hrbrthemes", "patchwork", "purrr", "TraMineR") ## Install uninstalled packages lapply(pkgs[!(pkgs %in% installed.packages())], install.packages, repos = getOption("repos")["CRAN"]) ## Load all packages to library and adjust options lapply(pkgs, library, character.only = TRUE) ## Don't forget to load ggseqplot library(ggseqplot) ```
```{r setup, message=FALSE} ## ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ## Creating state sequence objects from example data sets ---- ## ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ## biofam data data(biofam) biofam.lab <- c("Parent", "Left", "Married", "Left+Marr", "Child", "Left+Child", "Left+Marr+Child", "Divorced") biofam.seq <- seqdef(biofam[501:600, ], 10:25, # we only use a subsample labels = biofam.lab, weights = biofam$wp00tbgs[501:600]) ## actcal data data(actcal) actcal.lab <- c("> 37 hours", "19-36 hours", "1-18 hours", "no work") actcal.seq <- seqdef(actcal,13:24, labels=actcal.lab) ## ex1 data data(ex1) ex1.seq <- seqdef(ex1, 1:13, weights=ex1$weights) ``` Note that the default figure size in this document is specified as: `fig.width=8, fig.height=4.94` ## Technicalities In general, all [`{ggseqplot}`](https://maraab23.github.io/ggseqplot/) functions operate in the similar way: they extract the data to be plotted using a state sequence object generated with `TraMineR::seqdef` as a staring point. The functions either simply use (a subset of) the sequence data stored in this object or call other [`{TraMineR}`](http://traminer.unige.ch){target="_blank"} functions such as `TraMineR::seqstatd` to obtain the information to be plotted. Under the hood [`{ggseqplot}`](https://maraab23.github.io/ggseqplot/) reshapes those data to visualize them using [`{ggplot2}`](https://ggplot2.tidyverse.org/){target="_blank"} functions. Usually this means that the data have to be reshaped into a long (tidy) format. The following example illustrates the procedure for the case of a state distribution plot. The cross-sectional state distributions across the positions of the sequence data can be obtained by: ```{r seqstatd} seqstatd(actcal.seq) ``` When calling `ggseqdplot` these distributional data are reshaped into a long data set in which every row stores the (weighted) relative frequency of a given state at a given position along the sequence. The example data `actcal.seq` contain sequences of length 12 with an alphabet comprising 4 states. The reshaped data serving as source for the [`{ggplot2}`](https://ggplot2.tidyverse.org/){target="_blank"} call thus contain $12\times4=48$ rows. If a group vector is specified, the respective data will comprise 48 rows for each group. The data set produced by `ggseqdplot` can be accessed if the function's output is assigned to an object. The resulting list object stores the data as its first element (named `data`). ```{r dplot-data} dplot <- ggseqdplot(actcal.seq) dplot$data ``` Once the data are in the right shape [`{ggseqplot}`](https://maraab23.github.io/ggseqplot/) functions produce graphs using [`{ggplot2}`](https://ggplot2.tidyverse.org/){target="_blank"} functions. In the case of the state distribution plot, for instance, `ggseqdplot` renders stacked bar charts for each sequence position using `ggplot2::geom_bar`. The following table gives an overview of the most important internal function calls used to render different plot types with [`{ggseqplot}`](https://maraab23.github.io/ggseqplot/) | ggseqplot function | TraMineR function | ggplot2 geoms and extensions | |---------------|----------------|-----------------------------------------| | `ggseqdplot` | `TraMineR::seqstatd` | `ggplot2::geom_bar`
optional: `geom_line` | | `ggseqeplot` | `TraMineR::seqstatd` | `ggplot2::geom_line` | | `ggseqmsplot` | `TraMineR::seqmodst` | `ggplot2::geom_bar` | | `ggseqmtplot` | `TraMineR::seqmeant` | `ggplot2::geom_bar` | | `ggseqtrplot` | `TraMineR::seqtrate` | `ggplot2::geom_tile` | | `ggseqiplot` | `TraMineR::seqformat` | `ggplot2::geom_rect` | | `ggseqfplot` | `TraMineR::seqtab` | `ggplot2::geom_rect`
[`{ggh4x}`](https://teunbrand.github.io/ggh4x/){target="_blank"} (for the axis labeling if group has been specified) | | `ggseqrplot` | `TraMineR::seqrep` | `ggplot2::geom_rect`
`ggrepel::geom_text_repel`
[`{ggtext}`](https://wilkelab.org/ggtext/){target="_blank"} (for optional colored axis labels)
[`{patchwork}`](https://patchwork.data-imaginist.com/index.html){target="_blank"} (to combine plots) | | `ggseqrfplot` | `TraMineR::seqrfplot` | `ggplot2::geom_rect`
`ggplot2::geom_boxplot`
[`{patchwork}`](https://patchwork.data-imaginist.com/index.html){target="_blank"} (to combine plots) | The appearance of most of the plots generated with [`{ggseqplot}`](https://maraab23.github.io/ggseqplot/) can be adjusted just like every other ggplot (e.g., by changing the theme or the scale using `+` and the respective functions). Representative sequence plots and relative frequency sequence plots, however, behave differently because they are composed of multiple plots which are arranged by the [`{patchwork}`](https://patchwork.data-imaginist.com/index.html){target="_blank"} library. The following section illustrates how the appearance of the plots can be changed ## Changing the appearance of plots ### The default case As mentioned above, most plots rendered by [`{ggseqplot}`](https://maraab23.github.io/ggseqplot/) are of class `c("gg", "ggplot")` and can be adjusted just like other plots rendered with [`{ggplot2}`](https://ggplot2.tidyverse.org/){target="_blank"} #### Example 1: State distribution plot In our first example we illustrate this for state distribution plot. We start with the most basic version of the plot visualizing the state distributions of `actcal.seq` without changing any of the defaults. ```{r, dplot1} # ggseqplot::ggseqdplot ggseqdplot(actcal.seq) ``` We proceed by illustrating how [`{ggplot2}`](https://ggplot2.tidyverse.org/){target="_blank"} functions & extensions can be used to refine the default outcome. Just like every other [`{ggplot2}`](https://ggplot2.tidyverse.org/){target="_blank"} figure the appearance of plots generated with [`{ggseqplot}`](https://maraab23.github.io/ggseqplot/) functions can be dramatically changed with a few adjustments: ```{r, message=FALSE, warning=FALSE} ggseqdplot(actcal.seq) + scale_fill_discrete_sequential("heat") + scale_x_discrete(labels = month.abb) + labs(title = "State distribution plot", x = "Month") + guides(fill=guide_legend(title="Alphabet")) + theme_ipsum(base_family = "") + # ensures that this works on different OS theme(plot.title = element_text(size = 30, margin=margin(0,0,20,0)), plot.title.position = "plot") ``` In the following example we again illustrate a few [`{ggplot2}`](https://ggplot2.tidyverse.org/){target="_blank"} functions & extensions by composing a figure comprising two plots produced with `ggseqdplot`. Both visualize the same data but only the first plot considers weights. In addition to state distributions the plots display the accompanying entropies as line plot (`geom_line`). Finally, the plots are brought together using the [`{patchwork}`](https://patchwork.data-imaginist.com/index.html){target="_blank"} library [@pedersen2020]. ```{r, message=FALSE, warning=FALSE} # Save plot using weights p1 <- ggseqdplot(ex1.seq, with.entropy = TRUE) + ggtitle("Weighted data") # Save same plot without using weights p2 <- ggseqdplot(ex1.seq, with.entropy = TRUE, weighted = FALSE) + ggtitle("Unweighted data") # Arrange and refine plots using patchwork p1 + p2 + plot_layout(guides = "collect") & scale_fill_manual(values= canva_palettes$`Fun and tropical`[1:4]) & theme_ipsum(base_family = "") & theme(plot.title = element_text(size = 20, hjust = 0.5), legend.position = "bottom", legend.title = element_blank()) ``` #### Example 2: Transition rate plot The second set of examples illustrates how to refine a figure of combined transition rate plots. `ggseqtrplot` calls `TraMineR::seqtrate` to obtain the transition rates between the states of the alphabet. `TraMineR::seqtrate` stores these rates in a symmetrical matrix which internally is reshaped into a long format with one row for every combination of states (i.e., the squared size of the sequence alphabet) by `ggseqdplot`. The reshaped data are the input for a [`{ggplot2}`](https://ggplot2.tidyverse.org/){target="_blank"} call using `geom_tile`. We start with a simple example that only takes the sequence data and the group argument as inputs. The output is a faceted plot visualizing two transition rate matrices of DSS sequence data. ```{r} ggseqtrplot(actcal.seq, group = actcal$sex) ``` In the second example we specify additional arguments and utilize once again the [`{patchwork}`](https://patchwork.data-imaginist.com/index.html){target="_blank"} library to compose a figure that compares the transition matrices of sequence stored in the STS and the DSS format. We use `x_n.dodge = 2` to prevent overlapping of the state labels of the x-axis, slightly reduce the labels size of the value labels displayed within the tiles, and use `dss = FALSE` to compute and display the transition rates of the STS sequences. ```{r} p1 <- ggseqtrplot(biofam.seq, dss = FALSE, x_n.dodge = 2, labsize = 3) + ggtitle("STS Sequences") + theme(plot.margin = unit(c(5,10,5,5), "points")) p2 <- ggseqtrplot(biofam.seq, x_n.dodge = 2, labsize = 3) + ggtitle("DSS Sequences") + theme(plot.margin = unit(c(5,5,5,10), "points")) p1 + p2 & theme(plot.title = element_text(size = 20, hjust = 0.5)) ``` Other than the grouped version of the plot this composed figure contains the y-axis title and labels twice. This can be changed with small adjustments of the corresponding `theme` arguments. ```{r} p2 <- p2 + theme(axis.text.y = element_blank(), axis.title.y = element_blank()) p1 + p2 & theme(plot.title = element_text(size = 20, hjust = 0.5)) ``` #### Example 3: Flipping coordinates We conclude this section by illustrating that it is also possible to flip the coordinates of the plots rendered by [`{ggseqplot}`](https://maraab23.github.io/ggseqplot/), a procedure that is widely used in the [`{ggplot2}`](https://ggplot2.tidyverse.org/){target="_blank"} universe (although the coordinates could also be swapped in the `aes(x, y, ...)` specification). In the example below we illustrate the procedure for a mean time plot and a sequence index plot. We always present both, the default plot and the flipped version: ```{r} ## default plot ggseqmtplot(actcal.seq, no.n = TRUE, error.bar = "SE") ## flipped version ggseqmtplot(actcal.seq, no.n = TRUE, error.bar = "SE") + coord_flip() + theme(axis.text.y=element_blank(), axis.ticks.y = element_blank(), panel.grid.major.y = element_blank(), legend.position = "top") ``` While in the example above the flipped plot might be in greater accordance to most people's aesthetic preferences, flipping the coordinates in the case of sequence index plots might be a more of an opinionated design choice. Most scholars prefer to display time on the horizontal axis. However, if you favor time to *run* from the bottom to the top (like in Piccarreta and Lior [-@piccarreta2010]) instead of left to right, your preferences can be easily met. ```{r} ## default plot ggseqiplot(actcal.seq, sortv = "from.end") + scale_x_discrete(labels = month.abb) ## flipped version ggseqiplot(actcal.seq, sortv = "from.end") + scale_x_discrete(labels = month.abb) + coord_flip() ``` ### The special case of combined plots Two types of plots differ from the other [`{ggseqplot}`](https://maraab23.github.io/ggseqplot/) functions because they are composed by two subplots which are arranged to a joint figure with the [`{patchwork}`](https://patchwork.data-imaginist.com/index.html){target="_blank"} library. The output of those functions cannot be changed in the same was as for the other functions. For details on the [`{patchwork}`](https://patchwork.data-imaginist.com/index.html){target="_blank"} library we recommend the package's website. #### Example 4: Annotation and themes Some of the adjustments of a combined [`{patchwork}`](https://patchwork.data-imaginist.com/index.html){target="_blank"} plot are pretty similar to the default [`{ggplot2}`](https://ggplot2.tidyverse.org/){target="_blank"} procedure. In the example below we change the theme and add a title to the plot. Note that the corresponding functions are not added by `+` but with `&` instead. ```{r} ## compute dissimilarity matrix required for plot diss <- seqdist(biofam.seq, method = "LCS") ## Relative Frequency Sequence Plot ## default version ggseqrfplot(biofam.seq, diss = diss, k = 12) ## adjusted version ggseqrfplot(biofam.seq, diss = diss, k = 12) & theme_ipsum(base_family = "") & theme(legend.position = "bottom", legend.title = element_blank(), plot.title = element_text(size = 12)) & plot_annotation(title = "Relative Frequency Sequence Plot") ``` If you want to manipulate the appearance of a specific plot, however, your default code might not work. If you want to change the labels of the index plot, for instance, the following code will *not* produce the desired result, because `scale_x_discrete` will change the appearance of the boxplot, i.e. the last plot used when composing the plot with [`{patchwork}`](https://patchwork.data-imaginist.com/index.html){target="_blank"}. ```{r} ggseqrfplot(biofam.seq, diss = diss, k = 12) + scale_x_discrete(labels = 15:30) ``` The appearance of the subplots, however, can be changed once you save the composite plot and then adjust its components. ```{r} ## save & view original plot p <- ggseqrfplot(biofam.seq, diss = diss, k = 12) p ## change appearance of sub-plots ## first component: index plot p[[1]] <- p[[1]] + scale_x_discrete(labels = 15:30) ## second component: boxplot p[[2]] <- p[[2]] + labs(title = "Changed title") ## adjusted plot p ``` #### Example 5: The complex case of grouped rplots Note that things become a little bit more complex in the case of a grouped representative sequence plot. In such a plot each group contributes two subplots. One providing information on the "quality" of the representative sequences, and another one containing the corresponding index plots. If we want to change the x-axis labels of the following plot, we have to extract and change the index plots which appear in the second row of the combined plot. The plots are arranged by row. Hence the index plots are the subplots 3 and 4 ```{r} ## Compute a pairwise dissimilarity matrix diss <- seqdist(actcal.seq, method = "LCS") ## original plot p <- ggseqrplot(actcal.seq, diss = diss, nrep = 3, group = actcal$sex) p ## adjusted sequence index subplots p[[3]] <- p[[3]] + scale_x_discrete(labels = month.abb) p[[4]] <- p[[4]] + scale_x_discrete(labels = month.abb) p ``` #### Example 6: What about grouped rfplots? Grouped rfplots are currently not implemented for `ggseqrfplot` and have to be created manually using the [`{patchwork}`](https://patchwork.data-imaginist.com/index.html){target="_blank"} library. In the following example we 1. create rfplots for our groups (here "Men" and "Women"), then proceed by 2. slightly misusing [`{patchwork}`](https://patchwork.data-imaginist.com/index.html){target="_blank"}'s tag annotation assigning group-specific tags for the first plot in each row of the final patchwork plot, 3. removing the legend from the upper plot panel (`p$man`), 4. and arranging the adjusted patches in a final plot with two stacked panels. Technically speaking the resulting plot is a nested patchwork plot. According to the documentation of [`{patchwork}`](https://patchwork.data-imaginist.com/index.html){target="_blank"} >[i]t is important to note that plot annotations only have an effect on the top-level patchwork. Any annotation added to nested patchworks are (currently) lost. If you need to have annotations for a nested patchwork you’ll need to wrap it in wrap_elements() with the side-effect that alignment no longer works.^[https://patchwork.data-imaginist.com/articles/guides/annotation.html#titles-subtitles-and-captions] For this reason the group-specific titles were not added with `patchwork::plot_annotation` or `ggplot2::ggtitle` and we reverted to the use of tags instead. ```{r} diss <- seqdist(biofam.seq, method = "LCS") sex <- biofam[501:600, "sex"] p <- map2( levels(sex), # input x c("Men", "Women"), # input y function(x, y) { p <- ggseqrfplot(biofam.seq[sex == x,], diss = diss[sex == x,sex == x], k = 12) p[[1]] <- p[[1]] + labs(tag = y) return(p) } ) names(p) <- levels(sex) (p$man & theme(legend.position = "none")) / p$woman ``` ------------------------------------------------------------------------ ## References