---
title: "Tutorial for TDLM"
author: "Maxime Lenormand"
date: "`r Sys.Date()`"
output: 
  html_vignette:
    number_sections: false
  html_document:
    toc: true
    toc_float:
    collapsed: false
    smooth_scroll: false
    toc_depth: 2
vignette: >
  %\VignetteIndexEntry{Tutorial for TDLM}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
bibliography: '`r system.file("REFERENCES.bib", package="TDLM")`'
csl: format_citation.csl
---

```{r setup, include=FALSE}

knitr::opts_chunk$set(echo = TRUE, message = TRUE, warning = FALSE,
                      fig.width = 8, fig.height = 8)
# Packages --------------------------------------------------------------------
suppressPackageStartupMessages({
  suppressWarnings({
    library(TDLM)
    library(sf)
  })
})

options(tinytex.verbose = TRUE)

```

# Introduction

This tutorial aims to describe the different features of the R package `TDLM`.
The main purpose of the `TDLM` package is to provide a rigorous framework for 
fairly comparing trip distribution laws and models [@Lenormand2016]. This 
general framework is based on a two-step approach for generating mobility flows 
by separating the trip distribution law, such as gravity or intervening 
opportunities, from the modeling approach used to generate flows from this law.


# A short note on terminology

This framework is part of the four-step travel model. It corresponds to the
second step, called trip distribution, which aims to match trip origins with
trip destinations. The model used to generate trips or flows, and more
generally the degree of interaction between locations, is often referred to as
a spatial interaction model. Depending on the research area, a matrix or a
network formalism may be used to describe these spatial interactions.
Origin-Destination matrices (or trip tables) are common in geography or
transportation, while in statistical physics or complex systems studies, the
term mobility networks is usually preferred.

# Origin–Destination matrix

The description of movements within a given area is represented by an
Origin-Destination (OD) matrix. The area of interest is divided into $n$
locations, and $T_{ij}$ represents the volume of flows between location $i$ and
location $j$. This volume typically corresponds to the number of trips or a
commuting flow (i.e., the number of individuals living in $i$ and working in
$j$). The OD matrix is square, contains only positive values, and may have a
zero diagonal (Figure 1).

<br>
<center>
  <img align="bottom" width="60%" height="60%" src="../man/figures/OD.png"
  alt="OD matrix">
</center> 
<br>
<center>
  <b>Figure 1: Schematic representation of an Origin-Destination matrix.</b>
</center> 
<br>

# Aggregated inputs information

Three categories of inputs are typically considered to simulate an OD matrix
(Figure 2). The masses and distances are the primary ingredients used to
generate a matrix of probabilities based on a given distribution law. Thus, the
probability $p_{ij}$ of observing a trip from location $i$ to location $j$
depends on the masses, the demand at the origin ($m_i$), and the offer at
the destination ($m_j$). Typically, population is used as a surrogate for
demand and offer. The probability of movement also depends on the costs,
which are based on the distance $d_{ij}$ between locations or the number of
opportunities $s_{ij}$ between locations, depending on the chosen law (more
details are provided in the next "Trip distribution laws" section). In general,
the effect of the cost can be adjusted with a parameter.

The margins are used to generate an OD matrix from the matrix of probabilities
while preserving the total number of trips ($N$), the number of outgoing trips
($O_i$), and/or the number of incoming trips ($D_j$) (more details are provided
in the "Constrained distribution models" section).

<br>
<center>
  <img align="bottom" width="75%" height="75%" src="../man/figures/Inputs.png"
  alt="Inputs information">
</center> 
<br>
<center>
  <b>Figure 2: Schematic representation of the aggregated inputs information.
  </b>
</center> 
<br>

# Trip distribution laws

The purpose of a trip distribution law is to estimate the probability $p_{ij}$
that, out of all the possible travels in the system, there is one between
location $i$ and location $j$. This probability is asymmetric in $i$ and $j$,
as are the flows themselves. It takes the form of a square matrix of probabilities.
This probability is normalized across all possible pairs of origins and
destinations, such that $\sum_{i,j=1}^n p_{ij} = 1$. Therefore, a matrix of
probabilities can be obtained by normalizing any OD matrix (Figure 3).

<br>
<center>
  <img align="bottom" width="60%" height="60%" src="../man/figures/Proba.png"
  alt="Probability matrix">
</center> 
<br>
<center>
  <b>Figure 3: Schematic representation of the matrix of probabilities.
  </b>
</center> 
<br>

As mentioned in the previous section, most trip distribution laws depend
on the demand at the origin ($m_i$), the offer at the destination ($m_j$), and
a cost to move from $i$ to $j$. There are two major approaches for estimating
the probability matrix. The traditional gravity approach, in analogy with
Newton's law of gravitation, is based on the assumption that the amount of trips
between two locations is related to their populations and decays according to a
function of the distance $d_{ij}$ between locations. In contrast to the gravity
law, the laws of intervening opportunities hinge on the assumption that the
number of opportunities $s_{ij}$ between locations plays a more important role
than the distance [@Lenormand2016]. This fundamental difference between the two
schools of thought is illustrated in Figure 4.

<br>
<center>
  <img align="bottom" width="50%" height="50%" 
  src="../man/figures/GravVsOpp.png" alt="Gravity vs IO">
</center> 
<br>
<center>
  <b>Figure 4: Illustration of the fundamental difference between gravity and 
  intervening opportunity laws.
  </b>
</center> 
<br>

It is important to note that the effect of the cost between locations (distance
or number of opportunities) can usually be adjusted with a parameter, which can
be calibrated automatically or by comparing the simulated matrix with observed
data.

# Constrained distribution models

The purpose of the trip distribution models is to generate an OD matrix 
$\tilde{T}=(\tilde{T}_{ij})$ by 
drawing at random $N$ trips from the trip distribution law 
$(p_{ij})_{1 \leq i,j \leq n}$ respecting different level of constraints 
according to the model. We considered four different types of models in this
package. As can be observed in Figure 5, the four models respect different level
of constraints from the total number of trips to the total number of out-going 
and in-coming trips by locations (i.e. the margins).

<br>
<center>
  <img align="bottom" width="80%" height="80%" src="../man/figures/Models.png"
  alt="Constrained models">
</center> 
<br>
<center>
  <b>Figure 5: Schematic representation of the constrained distribution models.
  </b>
</center> 
<br>

More specifically, the volume of flows $\tilde{T}_{ij}$ is generated from the
matrix of probability with multinomial random draws that will take different 
forms according to the model used [@Lenormand2016]. Therefore, since the process
is stochastic, each simulated matrix is unique and composed of integers. Note 
that it is also possible to generate an average matrix from the multinomial 
trials.

# Goodness-of-fit measures

Finally, the trip distribution laws can be calibrated, and both the trip 
distribution laws and models can be evaluated by comparing a simulated matrix 
$\tilde{T}$ with the observed matrix $T$. These comparisons rely on various 
goodness-of-fit measures, which may or may not account for the distance 
between locations. These measures are described above.

## Common Part of Commuters (CPC)

$$\displaystyle CPC(T,\tilde{T}) = \frac{2\cdot\sum_{i,j=1}^n min(T_{ij},\tilde{T}_{ij})}{N + \tilde{N}}$$

[@Gargiulo2012;@Lenormand2012;@Lenormand2016]

## Normalized Root Mean Square Error (NRMSE)

$$\displaystyle NRMSE(T,\tilde{T}) = \sqrt{\frac{\sum_{i,j=1}^n (T_{ij}-\tilde{T}_{ij})^2}{N}}$$

## Kullback–Leibler divergence (KS)

$$\displaystyle KL(T,\tilde{T}) = \sum_{i,j=1}^n \frac{T_{ij}}{N}\log\left(\frac{T_{ij}}{N}\frac{\tilde{N}}{\tilde{T}_{ij}}\right)$$
[@Kullback1951]

## Common Part of Links (CPL) 

$$\displaystyle CPL(T,\tilde{T}) = \frac{2\cdot\sum_{i,j=1}^n 1_{T_{ij}>0} \cdot 1_{\tilde{T}_{ij}>0}}{\sum_{i,j=1}^n 1_{T_{ij}>0} + \sum_{i,j=1}^n 1_{\tilde{T}_{ij}>0}}$$
[@Lenormand2016]

## Common Part of Commuters based on the disance (CPC_d)

Let us consider $N_k$ (and 
$\tilde{N}_k$) the sum of observed (and simulated) flows at a distance comprised
in the bin $[\mbox{bin_size} \cdot k-\mbox{bin_size}, \mbox{bin_size} \cdot k[$.

$$\displaystyle CPC_d(T,\tilde{T}) = \frac{2\cdot\sum_{k=1}^{\infty} min(N_{k},\tilde{N}_{k})}{N+\tilde{N}}$$

[@Lenormand2016]

## Kolmogorv-Smirnov statistic and p-value (KS). 

These measures, described in @Massey1951, are based on the observed and 
simulated flow distance distributions and are computed using the [ks_test](https://hectorrdb.github.io/Ecume/reference/ks_test.html)
function from the [Ecume](https://cran.r-project.org/package=Ecume) package.

---

# Example of commuting in Kansas

## Data

In this example, we will use commuting data from US Kansas in 2000 to illustrate  
the main functions of the package. The dataset comprises three tables providing 
information on commuting flows between the 105 US 
Kansas counties in 2000. The observed OD matrix 
[od](https://epivec.github.io/TDLM/reference/od.html) is a zero-diagonal square 
matrix of integers. Each element of the matrix represents the number of 
commuters between a pair of US Kansas counties.

```{r}
data(od)

od[1:5, 1:5]

dim(od)
```

The aggregated data are composed of the 
[distance](https://epivec.github.io/TDLM/reference/distance.html) matrix,

```{r}
data(distance)

distance[1:5, 1:5]

dim(distance)
```

and the masses and margins contained in the data.frame 
[mass](https://epivec.github.io/TDLM/reference/mass.html).

```{r}
data(mass)

mass[1:10,]

dim(mass)

mi <- as.numeric(mass[,1])
names(mi) <- rownames(mass)

mj <- mi

Oi <- as.numeric(mass[,2])
names(Oi) <- rownames(mass)

Dj <- as.numeric(mass[,3])
names(Dj) <- rownames(mass)
```

The data must always be based on the same number of locations sorted in the same  
order. The function 
[check_format_names](https://epivec.github.io/TDLM/reference/check_format_names.html) 
can be used to verify the validity of all inputs before running the main 
functions of the package.

```{r}
check_format_names(vectors = list(mi = mi, mj = mj, Oi = Oi, Dj = Dj),
                   matrices = list(od = od, distance = distance),
                   check = "format_and_names")
```

Optional spatial information are also provided here. 
[county](https://epivec.github.io/TDLM/reference/county.html) is a spatial 
object containing the geometry of the 105 US Kansas counties in 2000. 

```{r}
library(sf)

data(county)

county[1:10,]

plot(county)
```

[coords](https://epivec.github.io/TDLM/reference/coords.html) and 
[coords_xy](https://epivec.github.io/TDLM/reference/coords_xy.html) are two 
dataframes providing longitude/latitude and X/Y coordinates for each location,
respectively.

```{r}
coords[1:10,]

coords_xy[1:10,]
```

## Extract Additional Spatial Information

The functions 
[extract_distances](https://epivec.github.io/TDLM/reference/extract_distances.html),
[extract_opportunities](https://epivec.github.io/TDLM/reference/check_format_names.html), 
and 
[extract_spatial_information](https://epivec.github.io/TDLM/reference/extract_spatial_information.html)  
can be used to extract matrices of distances and the number of intervening 
opportunities.

The first function computes distances in kilometers between pairs of locations 
based on geographical coordinates. It can calculate either great-circle 
distances, using longitude/latitude coordinates and the Haversine formula

```{r}
haversine_d <- extract_distances(coords = coords,
                                 method = "Haversine")
haversine_d[1:5, 1:5]

distance[1:5, 1:5]
```

or Euclidean distances based on X/Y coordinates

```{r}
xy_d <- extract_distances(coords = coords_xy,
                          method = "Euclidean")

oldmar <- par()$mar
par(mar = c(4.5, 6, 1, 1))
plot(haversine_d, xy_d, xlim=c(0,900), ylim=c(0,900),
     type="p", pch=16, cex=2, lty=1, lwd=3, 
     col="steelblue3", axes=FALSE, xlab="", ylab="")
axis(1, cex.axis=1.2)
axis(2, cex.axis=1.2, las=1)
mtext("Haversine (km)", 1, line = 3.25, cex = 1.75)
mtext("Euclidean (km)", 2, line = 4, cex = 1.75)
box(lwd=1.5)
par(mar = oldmar)
```

The second function computes the number of opportunities between pairs of 
locations. For a given pair of locations, the number of opportunities between 
the origin and the destination is based on the number of opportunities within 
a circle of radius equal to the distance between the origin and the destination,
centered at the origin. The number of opportunities at the origin and the 
destination are not included. In our case, the number of inhabitants ($m_i$) is 
used as a proxy for the number of opportunities.

```{r}
sij <- extract_opportunities(opportunity = mi,
                             distance = distance,
                             check_names = TRUE)
sij[1:5, 1:5]
```

The last function takes as input a spatial object containing the geometry of 
the locations that can be handled by the 
[sf](https://cran.r-project.org/package=sf) package. It returns a matrix of 
great-circle distances between locations (expressed in km). An optional `id`  
can also be provided to be used as names for the outputs.


```{r}
spi <- extract_spatial_information(county, id = "ID")

sp_d <- spi$distance

sp_d[1:5, 1:5]

distance[1:5, 1:5]
```

This function also allows extracting the surface area of each location 
(in square kilometers), which can be useful to calibrate the trip distribution  
laws' parameter value (see below).

```{r}
mean(spi$surface)
```

## Run functions

The main function of the package is 
[run_law_model](https://epivec.github.io/TDLM/reference/run_law_model.html). The
function has two sets of arguments, one for the law and another one for the 
model. The inputs (described above) necessary to run this function depends on 
the law (either the matrix of distances or number of opportunities can be used,
or neither of them for the uniform law) and on the model and its associated 
constraints (number of trips, out-going trips and/or in-coming trips). The 
example below will generate three simulated ODs with the normalized gravity law 
with an exponential distance decay function [@Lenormand2016] and the Doubly 
Constrained Model.

```{r}
res <- run_law_model(law = "NGravExp", 
                     mass_origin = mi, 
                     mass_destination = mj, 
                     distance = distance, 
                     opportunity = NULL,
                     param = 0.01,
                     write_proba = TRUE,
                     
                     model = "DCM", 
                     nb_trips = NULL, 
                     out_trips = Oi, 
                     in_trips = Dj,
                     average = FALSE, 
                     nbrep = 3)
```

The output is an object of class `TDLM`. In this case it is a list of matrices
composed of the three simulated matrices (`replication_1`, `replication_2` and 
`replication_3`), the  matrix of probabilities (called `proba`) associated with 
the law and returned only if `write_proba = TRUE`. The objects of class `TDLM`
contain a table `info` summarizing the simulation run. 

```{r}
print(res)

str(res)
```

This simulation run was based on one parameter value. It is possible to use a 
vector instead of a scalar for the `param` argument.

```{r}
res <- run_law_model(law = "NGravExp", 
                     mass_origin = mi, 
                     mass_destination = mj, 
                     distance = distance, 
                     opportunity = NULL,
                     param = c(0.01,0.02),
                     write_proba = TRUE,
                     
                     model = "DCM", 
                     nb_trips = NULL, 
                     out_trips = Oi, 
                     in_trips = Dj,
                     average = FALSE, 
                     nbrep = 3)
```

In this case a list of list of matrices will be returned (one for each parameter
value).

```{r}
print(res)

str(res)
```

It is also important to note that the radiation law and the uniform law are free
of parameter. 

```{r}
res <- run_law_model(law = "Rad", 
                     mass_origin = mi, 
                     mass_destination = mj, 
                     distance = NULL, 
                     opportunity = sij,
                     param = NULL,
                     write_proba = TRUE,
                     
                     model = "DCM", 
                     nb_trips = NULL, 
                     out_trips = Oi, 
                     in_trips = Dj,
                     average = FALSE, 
                     nbrep = 3)

print(res)
```

The argument `average` can be used to generate an average matrix based on a 
multinomial distribution (based on an infinite number of drawings). In this 
case, the models' inputs can be either positive integer or real numbers and the 
output (`nbrep = 1` in this case) will be a matrix of positive real numbers.

```{r}
res$replication_1[1:10,1:10]

res <- run_law_model(law = "Rad", 
                     mass_origin = mi, 
                     mass_destination = mj, 
                     distance = NULL, 
                     opportunity = sij,
                     param = NULL,
                     write_proba = TRUE,
                     
                     model = "DCM", 
                     nb_trips = NULL, 
                     out_trips = Oi, 
                     in_trips = Dj,
                     average = TRUE, 
                     nbrep = 3)

print(res)

res$replication_1[1:10,1:10]
```

The functions [run_law](https://epivec.github.io/TDLM/reference/run_law.html) 
and [run_model](https://epivec.github.io/TDLM/reference/run_model.html) have 
been designed to run only one of the two components of the two-step approach.
They function the same as a 
[run_law_model](https://epivec.github.io/TDLM/reference/run_law_model.html), but
it is worth noting that only inter-location flows are considered for the 
distribution laws, meaning that the matrix of probabilities (and associated 
simulated OD matrices) generated by a given distribution law with 
[run_law_model](https://epivec.github.io/TDLM/reference/run_law_model.html) or
[run_law](https://epivec.github.io/TDLM/reference/run_law.html) is a 
zero-diagonal matrix. Nevertheless, it is possible to generate intra-location 
flows with [run_model](https://epivec.github.io/TDLM/reference/run_model.html) 
taking any kind of matrix of probabilities as input.

## Parameters' calibration & models' evaluation

The package contains two function to help calibrating and evaluating the model.
The function [gof](https://epivec.github.io/TDLM/reference/gof.html) computes
goodness-of-fit measures between observed and simulated OD matrices and the 
function [calib_param](https://epivec.github.io/TDLM/reference/calib_param.html) 
that estimates the optimal parameter value for a given law and a given spatial 
distribution of location based on the Figure 8 in [@Lenormand2016]. 

Let us illustrate the trip distribution laws and models' calibration with the 
the normalized gravity law with an exponential distance decay function and the
Doubly Constrained Model. Based on the average surface area of the Kansas 
counties (in square kilometers) it seems that the optimal parameter value of the 
normalized gravity law with an exponential distance decay function (as described
in [@Lenormand2016]) for commuting in US Kansas counties is around 0.085.

```{r}
print(calib_param(av_surf = mean(spi$surface), law = "NGravExp"))
```

This is just an estimation that help us to identify the potential range of 
parameter value, so in order to rigorously calibrate and evaluate the
trip distribution law and model we need to compute the goodness-of-fit 
measure for a wide range of parameter values.

```{r}
res <- run_law_model(law = "NGravExp", 
                     mass_origin = mi, 
                     mass_destination = mj, 
                     distance = distance, 
                     opportunity = NULL,
                     param = seq(0.05,0.1,0.005),
                     write_proba = TRUE,
                     
                     model = "DCM", 
                     nb_trips = NULL, 
                     out_trips = Oi, 
                     in_trips = Dj,
                     average = FALSE, 
                     nbrep = 3)

calib <- gof(sim = res, obs = od, measures = "all", distance = distance)

print(calib)
```

All the necessary information is stored in the object calib, most of the 
goodness-of-fit measures agree on a parameter value of 0.075 in that case with
an associated average Common Part of Commuter equal to 85.6%.

```{r}
cpc <- aggregate(calib$CPC, list(calib$Parameter_value), mean)[,2]

oldmar <- par()$mar
par(mar = c(4.5, 6, 1, 1))
plot(seq(0.05,0.1,0.005), cpc, type="b", pch=16, cex=2, lty=1, lwd=3, 
     col="steelblue3", axes=FALSE, xlab="", ylab="")
axis(1, cex.axis=1.2)
axis(2, cex.axis=1.2, las=1)
mtext("Parameter value", 1, line = 3.25, cex = 1.75)
mtext("Common Part of Commuters", 2, line = 4, cex = 1.75)
box(lwd=1.5)
par(mar = oldmar)
```

---

# Reference