---
title: "Working with CFtime"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Working with CFtime}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```


```{r setup, include = FALSE}
library(CFtime)
library(ncdf4)
```

## Climate change models and calendars

Around the world, many climate change models are being developed (100+) under
the umbrella of the [World Climate Research Programme](https://www.wcrp-climate.org) to assess the
rate of climate change. Published data is generally publicly available to download for
research and other (non-commercial) purposes through partner organizations in 
the Earth Systems Grid Federation.

The data are all formatted to comply with the [CF Metadata Conventions](http://cfconventions.org),
a set of standards to support standardization among research groups and published
data sets. These conventions greatly facilitate use and analysis of the climate
projections because standard processing work flows (should) work across
the various data sets.

On the flip side, the CF Metadata Conventions needs to cater to a wide range of
modeling requirements and that means that some of the areas covered by the standards are
more complex than might be assumed. One of those areas is the temporal dimension
of the data sets. The CF Metadata Conventions supports no less than nine different
calendar definitions, that, upon analysis, fall into five distinct calendars
(from the perspective of computation of climate projections):

 - `standard` or `gregorian`: The international civil calendar that is in common use in 
 many countries around the world, adopted by edict of Pope Gregory XIII in 1582
 and in effect from 15 October of that year. The `proleptic_gregorian` calendar
 is the same as the `gregorian` calendar, but with validity
 extended to periods prior to `1582-10-15`.
 - `julian`: Adopted in the year 45 BCE, every fourth year is a leap year. 
 Originally, the julian calendar did not have a monotonically increasing year 
 assigned to it and there are indeed several julian calendars in use around the
 world today with different years assigned to them. Common interpretation
 is currently that the year is the same as that of the standard calendar. The 
 julian calendar is currently 13 days behind the gregorian calendar.
 - `365_day` or `noleap`: No years have a leap day.
 - `366_day` or `all_leap`: All years have a leap day.
 - `360_day`: Every year has 12 months of 30 days each.

The three latter calendars are specific to the CF Metadata Conventions to
reduce computational complexities of working with dates. These three, and the
julian calendar, are not compliant with the standard `POSIXt` date/time
facilities in `R` and using standard date/time procedures would quickly lead to
problems. In the below code snippet, the date of `1949-12-01` is the *datum*
from which other dates are calculated. When adding 43,289 days to this *datum*
for a data set that uses the `360_day` calendar, that should yield a date some 120 
years after the *datum*:

```{r}
# POSIXt calculations on a standard calendar - INCORRECT
as.Date("1949-12-01") + 43289

# CFtime calculation on a "360_day" calendar - CORRECT
# See below examples for details on the two functions
as_timestamp(CFtime("days since 1949-12-01", "360_day", 43289))
```

Using standard `POSIXt` calculations gives a result that is about 21 months off
from the correct date - obviously an undesirable situation. This example is far
from artificial: `1949-12-01` is the datum for all CORDEX data, covering the 
period 1951 - 2005 for historical experiments and the period 2006 - 2100 for RCP 
experiments (with some deviation between data sets), and several models used in 
the CORDEX set use the `360_day` calendar. The `365_day` or `noleap` calendar 
deviates by about 1 day every 4 years (disregarding centurial years), or about 
24 days in a century. The `366_day` or  `all_leap` calendar deviates by about 3 
days every 4 years, or about 76 days in a century.

The `CFtime` package deals with the complexity of the different calendars allowed
by the CF Metadata Conventions. It properly formats dates and times (even oddball
dates like `2070-02-30`) and it can generate calendar-aware factors for further 
processing of the data.

##### Time zones

The character of CF time series - a number of numerical offsets from a base date -
implies that there should only be a single time zone associated with the time
series. The time zone offset from UTC is stored in the datum and can be retrieved
with the `timezone()` function. If a vector of character timestamps with time
zone information is parsed with the `CFparse()` function and the time zones are 
found to be different from the datum time zone, a warning message is generated
but the timestamp is interpreted as being in the datum time zone. No correction 
of timestamp to datum time zone is performed.

## Using CFtime to deal with calendars

Data sets that are compliant with the CF Metadata Conventions always include a 
*datum*, a specific point in time in reference to a specified *calendar*, from 
which other points in time are calculated by adding a specified *offset* of a
certain *unit*. This approach is encapsulated in the `CFtime` package by the S4 
class `CFtime`.

```{r}
# Create a CF time object from a definition string, a calendar and some offsets
cf <- CFtime("days since 1949-12-01", "360_day", 19830:90029)
cf
```

The `CFtime()` function takes a *datum* description (which is actually a unit -
"days" - in reference to a datum - "1949-12-01"), a calendar description, and a 
vector of *offsets* from that datum. Once a `CFtime` instance is created its 
datum and calendar cannot be changed anymore. Offsets may be added.

In practice, these parameters will be taken from the data set of interest. CF Metadata
Conventions require data sets to be in the NetCDF format, with all metadata
describing the data set included in a single file, including the mandatory 
"Conventions" global attribute which should have a string identifying the version 
of the CF Metadata Conventions that this file adheres to (among possible others).
Not surprisingly, all the pieces of interest are contained in the mandatory `time` 
dimension of the file. The process then becomes as follows, for a CMIP6 file of 
daily precipitation:

```{r}
# Opening a data file that is included with the package and showing some attributes.
# Usually you would `list.files()` on a directory of your choice.
fn <- list.files(path = system.file("extdata", package = "CFtime"), full.names = TRUE)[1]
nc <- nc_open(fn)
attrs <- ncatt_get(nc, "")
attrs$title

# "Conventions" global attribute must have a string like "CF-1.*" for this package to work reliably
attrs$Conventions

# Create the CFtime instance from the metadata in the file.
cf <- CFtime(nc$dim$time$units, 
             nc$dim$time$calendar, 
             nc$dim$time$vals)
cf
```

You can see from the global attribute "Conventions" that the file adheres to the
CF Metadata Conventions, among others. According to the CF conventions, `units` 
and `calendar` are required attributes of the `time` dimension in the NetCDF file, 
and `nc$dim$time$vals` are the offset values, or `dimnames()` in `R` terms, for 
the `time` dimension of the data.

The above example (and others in this vignette) use the `ncdf4` package. If you 
are using the `RNetCDF` package, checking for CF conventions and then creating a 
`CFtime` instance goes like this:

```{r}
library(RNetCDF)
nc <- open.nc(fn)
att.get.nc(nc, -1, "Conventions")
cf <- CFtime(att.get.nc(nc, "time", "units"), 
             att.get.nc(nc, "time", "calendar"), 
             var.get.nc(nc, "time"))
cf
```

The corresponding character representations of the time series can be easily generated:

```{r}
dates <- as_timestamp(cf, format = "date")
dates[1:10]
```

...as well as the full range of the time series:

```{r}
range(cf)
```

Note that in this latter case, if any of the timestamps in the time series have a time that is
other than `00:00:00` then the time of the extremes of the time series is also
displayed. This is a common occurrence because the CF Metadata Conventions
prescribe that the middle of the time period (month, day, etc) is recorded, which 
for months with 31 days would be something like `2005-01-15T12:00:00`.

## Supporting processing of climate projection data

When working with high resolution climate projection data, typically at a "day"
resolution, one of the processing steps would be to aggregate the data to some
lower resolution such as a dekad (10-day period), a month or a meteorological 
season, and then compute a derivative value such as the dekadal sum of precipitation,
monthly minimum/maximum daily temperature, or seasonal average daily short-wave
irradiance.

It is also possible to create factors for multiple "epochs" in one go. This greatly
reduces programming effort if you want to calculate anomalies over multiple future
periods. A complete example is provided in the vignette 
["Processing climate projection data"](Processing.html).

It is easy to generate the factors that you need once you have a `CFtime` instance 
prepared:

```{r}
# Create a dekad factor for the whole `cf` time series that was created above
f_k <- CFfactor(cf, "dekad")
str(f_k)

# Create monthly factors for a baseline epoch and early, mid and late 21st century epochs
baseline <- CFfactor(cf, epoch = 1991:2020)
future <- CFfactor(cf, epoch = list(early = 2021:2040, mid = 2041:2060, late = 2061:2080))
str(future)
```

For the "epoch" version, there are two interesting things to note here:

 - The epochs do not have to coincide with the boundaries of the time series. In the 
 example above, the time series starts in 2015, while the baseline epoch is from 1991.
 Obviously, the number of time steps from the time series that then fall within 
 this epoch will then be reduced.
 - The factor is always of the same length as the time series, with `NA` values
 where the time series values are not falling in the epoch. This ensures that the
 factor is compatible with the data set which the time series describes, such
 that functions like `tapply()` will not throw an error.
 
There are six periods defined for `CFfactor()`:

 - `year`, to summarize data to yearly timescales
 - `season`, the meteorological seasons. Note that the month of December will be
 added to the months of January and February of the following year, so the date 
 "2020-12-01" yields the factor value "2021S1".
 - `quarter`, the standard quarters of the year.
 - `month`, monthly summaries, the default period.
 - `dekad`, 10-day period. Each month is subdivided in dekads as follows: (1) 
 days 01 - 10; (2) days 11 - 20; (3) remainder of the month.
 - `day`, to summarize sub-daily data.

##### New "time" dimension

A CFtime instance describes the "time" dimension of an associated data set. When
you process that dimension of the data set using `CFfactor()` or another method
to filter or otherwise subset the "time" dimension, the resulting data set will
have a different "time" dimension. To associate a proper CFtime instance with
your processing result, the methods in this package return that CFtime instance
as an attribute:

```{r}
  new_time <- attr(f_k, "CFtime")
  new_time
```

In the vignette ["Processing climate projection data"](Processing.html) is a
fully worked out example of this.

##### Incomplete time series

You can test if your time series is complete with the function `CFcomplete()`.
A time series is considered complete if the time steps between the two extreme 
values are equally spaced. There is a "fuzzy" assessment of completeness for
time series with a datum unit of "days" or smaller where the time steps are
months or years apart - these have different lengths in days in different months 
or years (e.g. a leap year).

If your time series is incomplete, for instance because it has missing time steps, 
you should recognize that in your further processing. As an example, you might 
want to filter out months that have fewer than 90% of daily data from further 
processing or apply weights based on the actual coverage. 

```{r}
# Is the time series complete?
is_complete(cf)

# How many time units fit in a factor level?
CFfactor_units(cf, baseline)

# What's the absolute and relative coverage of our time series
CFfactor_coverage(cf, baseline, "absolute")
CFfactor_coverage(cf, baseline, "relative")
```

The time series is complete but coverage of the baseline epoch is only 
20%! Recall that the time series starts in 2015 while the baseline period in the
factor is for `1991:2020` so that's only 6 years of time series data out of 30 
years of the baseline factor.

An artificial example of missing data:

```{r}
# 4 years of data on a `365_day` calendar, keep 80% of values
n <- 365 * 4
cov <- 0.8
offsets <- sample(0:(n-1), n * cov)

cf <- CFtime("days since 2020-01-01", "365_day", offsets)
cf
# Note that there are about 1.25 days between observations

mon <- CFfactor(cf, "month")
CFfactor_coverage(cf, mon, "absolute")
CFfactor_coverage(cf, mon, "relative")
```

Keep in mind, though, that there are data sets where the time unit is lower than
the intended resolution of the data. Since the CF conventions recommend that the
coarsest time unit is "day", many files with monthly data sets have a definition
like `days since 2016-01-01` with offset values for the middle of the month like
`15, 44, 74, 104, ...`. Even in these scenarios you can verify that your data 
set is complete with the function `CFcomplete()`.

## CFtime and POSIXt

The CF Metadata Conventions support nine different calendars. Of these, only the
`standard`, `gregorian` and `proleptic_gregorian` calendars are fully compatible
with POSIXt. The other calendars have varying degrees of discrepancies:

 - `julian`: Every fouth year is a leap year. Dates like `2100-02-29` and
 `2200-02-29` are valid.
 - `365_day` or `noleap`: No leap year exists. `2020-02-29` does not occur.
 - `366_day` or `all_leap`: All years are leap years.
 - `360_day`: All months have 30 days in every year. This means that 31 January,
 March, May, July, August, October and December never occur, while 29 and 30
 February occur in every year.
 
Converting time series using these incompatible calendars to `Date`s is likely
to produce problems. This is most pronounced for the `360_day` calendar:

```{r}
# Days in January and February
cf <- CFtime("days since 2023-01-01", "360_day", 0:59)
cf_days <- as_timestamp(cf, "date")
as.Date(cf_days)
```

31 January is missing from the vector of `Date`s because the `360_day` calendar 
does not include it and 29 and 30 February are `NA`s because POSIXt rejects them.
This will produce problems later on when processing your data.

The general advice is therefore: **do not convert CFtime objects to Date objects**
unless you are sure that the `CFtime` object uses a POSIXt-compatible calendar.

##### So how do I compare climate projection data with different calendars?

One reason to convert the "time" dimension from different climate projection data
sets is to be able to compare the data from different models and produce a
multi-model ensemble. The correct procedure to do this is to first calculate 
**for each data set individually** the property of interest (e.g. average daily 
rainfall per month anomaly for some future period relative to a baseline period), 
which will typically involve aggregation to a lower resolution (such as from 
daily data to monthly averages), and only then combine the aggregate data from 
multiple data sets to compute statistically interesting properties (such as 
average among models and standard deviation, etc).

Once data is aggregated from daily or higher-resolution values, the different 
calendars no longer matter (although if you do need to convert averaged data to
absolute data you should use `CFfactor_units()` to make sure that you use the
correct scaling factor).

Otherwise, there really shouldn't be any reason to convert the time series in the
data files to `Date`s. Climate projection data is virtually never compared on a 
day-to-day basis between different models and neither does complex date arithmetic 
make much sense (such as adding intervals) - `CFtime` can support basic arithmetic
by manipulation the offsets of the `CFtime` object. The character representations
that are produced are perfectly fine to use for `dimnames()` on an array or as
`rownames()` in a `data.frame` and these also support basic logical operations
such as `"2023-02-30" < "2023-03-01"`. So ask yourself, do you really need `Date`s
when working with unprocessed climate projection data? (If so, [open an issue on 
GitHub](https://github.com/pvanlaake/CFtime/issues)).

A complete example of creating a multi-model ensemble is provided in the vignette 
["Processing climate projection data"](Processing.html).

## Final observations

 - This package is intended to facilitate processing of climate projection
 data. It is not a full implementation of the CF Metadata Conventions "time"
 component.
 - In parsing and deparsing of offsets and timestamps, data is rounded to
 3 digits of precision of the unit of the datum. When using a description of time
 that is very different from the datum unit, this may lead to some loss of 
 precision due to rounding errors. For instance, if milli-second precision is 
 required, use a unit of "seconds". The authors have no knowledge of published
 climate projection data that requires milli-second precision so for the
 intended use of the package this issue is marginal.