--- title: "Uniform interface to three regex engines" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Uniform interface to three regex engines} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` Several C libraries providing regular expression engines are available in R. The standard R distribution has included the Perl-Compatible Regular Expressions (PCRE) C library since 2002. CRAN package re2r provides the RE2 library, and stringi provides the ICU library. Each of these regex engines has a unique feature set, and may be preferred for different applications. For example, PCRE is installed by default, RE2 guarantees matching in polynomial time, and ICU provides strong unicode support. For a more detailed comparison of the relative strengths of each regex library, we refer the reader to our previous research paper, [Comparing namedCapture with other R packages for regular expressions](https://journal.r-project.org/archive/2019/RJ-2019-050/index.html). Each regex engine has a different R interface, so switching from one engine to another may require non-trivial modifications of user code. In order to make switching between engines easier, the namedCapture package provides a uniform interface for capturing text using PCRE and RE2. The user may specify the desired engine via an option; the namedCapture package provides the output in a uniform format. However namedCapture requires the engine to support specifying capture group names in regex pattern strings, and to support output of the group names to R (which ICU does not support). Our proposed nc package provides support for the ICU engine in addition to PCRE and RE2. The nc package implements this functionality using un-named capture groups, which are supported in all three regex engines. In particular, a regular expression is constructed in R code that uses named arguments to indicate capturing sub-patterns, which are translated to un-named groups when passed to the regex engine. For example, consider a user who wants to capture the two pieces of the column names of the iris data, e.g., `Sepal.Length`. The user would typically specify the capturing regular expression as a string literal, e.g., `"(.*)[.](.*)"`. Using nc the same pattern can be applied to the iris data column names via ```{r} nc::capture_first_vec( names(iris), part = ".*", "[.]", dim = ".*", engine = "ICU", nomatch.error = FALSE) ``` Above we see an example usage of `nc:capture_first_vec`, which is for capturing the first match of a regex from each element of a character vector subject (the first argument). There are a variable number of other arguments (`...`) which are used to define the regex pattern. In this case there are three pattern arguments: `part = ".*", "[.]", dim = ".*"`. Each named R argument in the pattern generates an un-named capture group by enclosing the specified character string in parentheses, e.g., `(.*)` for both `part` and `dim` arguments above. All of the sub-patterns are pasted together in the sequence they appear in order to create the final pattern that is used with the specified regex engine. The `nomatch.error = FALSE` argument is given because the default is to stop with an error if any subjects do not match the specified pattern (the fifth subject `Species` does not match). Under the hood, the following function is called to parse the pattern arguments: ```{r} str(compiled <- nc::var_args_list(part = ".*", "[.]", dim = ".*")) ``` This function is intended mostly for internal use, but can be useful for viewing the generated regex pattern (or using it as input to another regex function). The return value is a named list of two elements: `pattern` is the capturing regular expression which is generated based on the input arguments, and `fun.list` is a named list of type conversion functions. If the user does not specify a type conversion function for a group (as in the example code above), then the default is `base::identity`, which simply returns the captured character strings. Group-specific type conversion functions are useful for converting captured text into numeric output columns. Note that the order of elements in `fun.list` corresponds to the order of capture groups in the pattern (e.g., first capture group named `part`, second `dim`). These data can be used with any regex engine that supports un-named capture groups (including ICU) in order to get a capture matrix with column names, e.g. ```{r} m <- stringi::stri_match_first_regex(names(iris), compiled$pattern) colnames(m) <- c("match", names(compiled$fun.list)) m ``` Again, this is not the recommended usage of nc, but here we give these details in order to explain how it works. Note that the result from stringi is a character matrix with three columns: first for the entire match, and another column for each capture group. Using the same pattern with `base::regexpr` (PCRE engine) or `re2r::re2_match` (RE2 engine) yields output in varying formats. The nc package takes care of converting these different results into a standard data table format which makes it easy to switch regex engines (by changing the value of the `engine` argument). Most of the time the different engines give similar results, but in some cases there are differences: ```{r, error=TRUE, purl=FALSE} u.subject <- "a\U0001F60E#" u.pattern <- list( emoji="\\p{EMOJI_Presentation}")#only supported in ICU. old.opt <- options(nc.engine="ICU") nc::capture_first_vec(u.subject, u.pattern) nc::capture_first_vec(u.subject, u.pattern, engine="PCRE") nc::capture_first_vec(u.subject, u.pattern, engine="RE2") options(old.opt) ``` Note that the standard output format used by nc, as shown above with `nc::capture_first_vec`, is a data table (not a character matrix, as in other regex packages). The main reason that data tables are always output by nc is in order to support output columns of different types, when type conversion functions are specified.