--- title: "Filtering GTFS feeds" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Filtering GTFS feeds} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(gtfstools) ``` GTFS feeds are often used to describe very large and complex public transport networks. These large files can become quite cumbersome to use, manipulate and move around, and it's not unfrequent that one wants to analyze a specific subset of the data. **gtfstools** includes a few functions to filter GTFS feeds, thus allowing for faster and more convenient data processing. The filtering functions currently available are: - `filter_by_agency_id()` - `filter_by_route_id()` - `filter_by_service_id()` - `filter_by_shape_id()` - `filter_by_stop_id()` - `filter_by_trip_id()` - `filter_by_route_type()` - `filter_by_weekday()` - `filter_by_time_of_day()` - `filter_by_sf()` This vignette will introduce you to these functions and will cover their usage in detail. We will start by loading the packages required by this demonstration into the current **R** session: ```{r, message = FALSE, eval = requireNamespace("ggplot2", quietly = TRUE)} library(gtfstools) library(ggplot2) ``` # Filtering by `agency_id`, `route_id`, `service_id`, `shape_id`, `stop_id`, `trip_id` or `route_type`: The first six work in a very similar fashion. You specify a vector of identifiers, and the function keeps (or drops, as we'll see soon) all the entries that are in any way related to this id. Let's see how that works using `filter_by_trip_id()`: ```{r} path <- system.file("extdata/spo_gtfs.zip", package = "gtfstools") gtfs <- read_gtfs(path) utils::object.size(gtfs) head(gtfs$trips[, .(trip_id, trip_headsign, shape_id)]) # keeping trips CPTM L07-0 and CPTM L07-1 smaller_gtfs <- filter_by_trip_id(gtfs, c("CPTM L07-0", "CPTM L07-1")) utils::object.size(smaller_gtfs) head(smaller_gtfs$trips[, .(trip_id, trip_headsign, shape_id)]) unique(smaller_gtfs$shapes$shape_id) ``` We can see from the code snippet above that the function not only filters the `trips` table, but all other tables that contain a key that can be identified via its relation to `trip_id`. For example, since the trips `CPTM L07-0` and `CPTM L07-1` are described by the shapes `17846` and `17847`, respectively, these are the only shapes kept in `smaller_gtfs`. The function also supports the opposite behaviour: instead of keeping the entries related to the specified identifiers, you can *drop* them. To do that, set the `keep` argument to `FALSE`: ```{r} # dropping trips CPTM L07-0 and CPTM L07-1 smaller_gtfs <- filter_by_trip_id( gtfs, c("CPTM L07-0", "CPTM L07-1"), keep = FALSE ) utils::object.size(smaller_gtfs) head(smaller_gtfs$trips[, .(trip_id, trip_headsign, shape_id)]) head(unique(smaller_gtfs$shapes$shape_id)) ``` And the specified trips (and their respective shapes as well) are nowhere to be seen. Please note that, since we are keeping many more entries in the second case, the resulting GTFS object, though smaller than the original, is much larger than in the first case. The same logic demonstrated with `filter_by_trip_id()` applies to the functions that filter feeds by `agency_id`, `route_id`, `service_id`, `shape_id`, `stop_id` and `route_type`. # Filtering by day of the week or time of the day: Frequently enough one wants to analyze service levels on certain days of the week or during different times of the day. The functions `filter_by_weekday()` and `filter_by_time_of_day()` can be used to this purpose. The first one takes the days of the week you want to keep/drop and also includes a `combine` argument that controls how multi-day filters work. Let's see how that works with a few examples: ```{r} # keeping entries related to services than run on saturdays AND sundays smaller_gtfs <- filter_by_weekday( gtfs, weekday = c("saturday", "sunday"), combine = "and" ) smaller_gtfs$calendar[, c("service_id", "sunday", "saturday")] # keeping entries related to services than run EITHER on saturdays OR on sundays smaller_gtfs <- filter_by_weekday( gtfs, weekday = c("sunday", "saturday"), combine = "or" ) smaller_gtfs$calendar[, c("service_id", "sunday", "saturday")] # dropping entries related to services that run on saturdaus AND sundays smaller_gtfs <- filter_by_weekday( gtfs, weekday = c("saturday", "sunday"), combine = "and", keep = FALSE ) smaller_gtfs$calendar[, c("service_id", "sunday", "saturday")] # dropping entries related to services than run EITHER on saturdays OR on # sundays smaller_gtfs <- filter_by_weekday( gtfs, weekday = c("sunday", "saturday"), combine = "or", keep = FALSE ) smaller_gtfs$calendar[, c("service_id", "sunday", "saturday")] ``` Meanwhile, `filter_by_time_of_day()` takes the beginning and the end of a time block (the `from` and `to` arguments, respectively) and keeps the entries related to trips that run within the specified block. Please note that the function works a bit differently depending on whether a trip's behaviour is described using the `frequencies` and the `stop_times` tables together or using the `stop_times` table alone: the `stop_times` entries of trips described in `frequencies` should not be filtered, because they are just "templates" that describe how long it takes from one stop to another (i.e. the departure and arrival times listed there should not be considered "as is"). Let's see what that means with an example: ```{r} smaller_gtfs <- filter_by_time_of_day(gtfs, from = "05:00:00", to = "06:00:00") head(smaller_gtfs$frequencies) # stop_times entries are preserved because they should be interpreted as # "templates" head(smaller_gtfs$stop_times[, c("trip_id", "departure_time", "arrival_time")]) # had the feed not had a frequencies table, the stop_times table would be # adjusted frequencies <- gtfs$frequencies gtfs$frequencies <- NULL smaller_gtfs <- filter_by_time_of_day(gtfs, from = "05:00:00", to = "06:00:00") head(smaller_gtfs$stop_times[, c("trip_id", "departure_time", "arrival_time")]) ``` When filtering the `stop_times` table, we have two options. We either keep entire trips that cross the specified time block, or we keep only the trip segments within this block (default behaviour). To control this behaviour you can use the `full_trips` parameter: ```{r} smaller_gtfs <- filter_by_time_of_day( gtfs, "05:00:00", "06:00:00", full_trips = TRUE ) # CPTM L07-0 trip is kept intact because it crosses the time block head(smaller_gtfs$stop_times[, c("trip_id", "departure_time", "arrival_time")]) # dropping entries related to trips that cross the specified time block smaller_gtfs <- filter_by_time_of_day( gtfs, "05:00:00", "06:00:00", full_trips = TRUE, keep = FALSE ) # CPTM L07-0 trip is gone head(smaller_gtfs$stop_times[, c("trip_id", "departure_time", "arrival_time")]) ``` `filter_by_time_of_day()` also includes a `update_frequencies` argument, used to control whether the `frequencies` table should have its `start_time` and `end_time` fields updated to fit inside/outside the specified time of day. Please read the function documentation to understand how this argument interacts with the `exact_times` field. # Filtering using a spatial extent It's not uncommon that one wants to analyze only the transit services of a smaller region contained inside a feed. The `filter_by_sf()` function allows you to filter GTFS data using a given spatial extent. This functions takes a spatial `sf`/`sfc` object (or its bounding box) and keeps/drops the entries related to shapes and trips selected via a specified spatial operation. It may sound a bit complicated, but it's fairly easy to understand when shown. Let's create an auxiliary function to save us some typing: ```{r} plotter <- function(gtfs, geom, spatial_operation = sf::st_intersects, keep = TRUE, do_filter = TRUE) { if (do_filter) { gtfs <- filter_by_sf(gtfs, geom, spatial_operation, keep) } shapes <- convert_shapes_to_sf(gtfs) trips <- get_trip_geometry(gtfs, file = "stop_times") geom <- sf::st_as_sfc(geom) ggplot() + geom_sf(data = trips) + geom_sf(data = shapes) + geom_sf(data = geom, fill = NA) } ``` This function: - Conditionally filters a GTFS object given a spatial object (called `geom`); - Generates shapes' and trips' geometries as described in their respective tables; - Generates a polygon from the bounding box; - Plots all the `sf` objects cited above to show the effect of each `filter_by_sf()` argument in the final result. Also, please note that our `plotter()` function takes the same arguments of `filter_by_sf()` (with the exception of `do_filter`, which is used to show the unfiltered data), as well as the same defaults. Let's say that we want to filter GTFS data using the bounding box of the shape `68962`. Here's how the unfiltered data looks like, with the bounding box placed on top of it. ```{r, eval = requireNamespace("ggplot2", quietly = TRUE)} bbox <- sf::st_bbox(convert_shapes_to_sf(gtfs, shape_id = "68962")) plotter(gtfs, bbox, do_filter = FALSE) ``` By default `filter_by_sf()` (and `plotter()`, consequently) keeps all the data related to the trips and shapes that intersect the given geometry. Here's how it looks like: ```{r, eval = requireNamespace("ggplot2", quietly = TRUE)} plotter(gtfs, bbox) ``` Alternatively you can also *drop* such data: ```{r, eval = requireNamespace("ggplot2", quietly = TRUE)} plotter(gtfs, bbox, keep = FALSE) ``` You can also control which spatial operation you want to use to filter the data. This is how you'd keep the data that is contained inside the given geometry: ```{r, eval = requireNamespace("ggplot2", quietly = TRUE)} plotter(gtfs, bbox, spatial_operation = sf::st_contains) ``` And, simultaneously using `spatial_operation` and `keep`, this is how you'd drop the data contained inside the geometry: ```{r, eval = requireNamespace("ggplot2", quietly = TRUE)} plotter(gtfs, bbox, spatial_operation = sf::st_contains, keep = FALSE) ``` All filtering functions return a GTFS object readily available to be manipulated and analyzed using the rest of **gtfstools**' toolkit. For more information on how to use other functions made available by the package, please see the [introductory vignette](gtfstools.html).