In this vignette, we introduce how primitive Presto data types are
translated into R types in the RPresto
package.
You can check your RPresto
version by running the
packageVersion()
function. You need version 1.3.9
or later to have a more comprehensive and robust primitive types
support.
Here primitive types refer to basic data types that are
atomic and non-nested (as opposed to structural types such as
ARRAY
, MAP
, and ROW
). Refer to
the Presto
documentation for a complete list of Presto data types.
We don’t currently support all of Presto’s primitive types. Here we summarize what’s supported and what those supported types map to in R.
Category | Presto data type | R type |
---|---|---|
Boolean | BOOLEAN | logical |
Integer | TINYINT | integer |
Integer | SMALLINT | integer |
Integer | INTEGER | integer |
Integer | BIGINT | {integer|integer64|numeric|character} |
Floating-point | REAL | numeric |
Floating-point | DOUBLE | numeric |
Fixed-precision | DECIMAL | character |
String | VARCHAR | character |
String | CHAR | character |
String | VARBINARY | raw |
String | JSON | not supported |
Date and Time | DATE | Date (S3 class) |
Date and Time | TIME | difftime (S3 class) |
Date and Time | TIME WITH TIME ZONE | difftime (S3 class) |
Date and Time | TIMESTAMP | POSIXct (S3 class) |
Date and Time | TIMESTAMP WITH TIME ZONE | POSIXct (S3 class) |
Date and Time | INTERVAL YEAR TO MONTH | Duration (S4 class) |
Date and Time | INTERVAL DAY TO SECOND | Duration (S4 class) |
We assume that the user already have a Presto server with a memory connector set up. If you don’t have such a server set up, refer to the Presto documentation for instructions if you want to follow along.
We first create a PrestoConnection
which will serve as
the bridge between the Presto server and R.
con <- DBI::dbConnect(
drv = RPresto::Presto(),
host = "http://localhost",
port = 8080,
user = Sys.getenv("USER"),
catalog = "memory",
schema = "default"
)
We first issue a simple query to see if the Presto connection is working properly.
To illustrate how those primitive types are mapped to R types in bulk, we first create a made-up table with 3 rows and 17 fields (i.e., one column for each supported Presto type).
We create the table using an auxiliary
create_primitive_types_table()
function included in the
package.
RPresto:::create_primitive_types_table(
con, table_name = "presto_primitive_types", verbose = FALSE
)
We can check if the table now exists in Presto.
We can list the fields in the table. They are named after the Presto types they represent.
DBI::dbListFields(con, "presto_primitive_types")
#> [1] "boolean" "tinyint" "smallint"
#> [4] "integer" "bigint" "real"
#> [7] "double" "decimal" "varchar"
#> [10] "char" "varbinary" "date"
#> [13] "time" "time_with_tz" "timestamp"
#> [16] "timestamp_with_tz" "interval_year_to_month" "interval_day_to_second"
Translating boolean values from Presto to R is fairly
straightforward. true
and false
values are
mapped to TRUE
and FALSE
in R and
null
is mapped to NA
which is by default a
logical
(i.e., boolean) type in R.
(
df.boolean <- dbGetQuery(con, "SELECT boolean FROM presto_primitive_types")
)
#> # A tibble: 3 × 1
#> boolean
#> <lgl>
#> 1 TRUE
#> 2 FALSE
#> 3 NA
We can verify that all three values in R are
logical
.
Presto has 4 integer data types.
TINYINT
is 8-bit and ranges from -2^7
to
2^7-1
.
SMALLINT
is 16-bit and ranges from -2^15
to
2^15-1
.
INTEGER
is 32-bit and ranges from -2^31
to
2^31-1
.
BIGINT
is 64-bit and ranges from -2^63
to
2^63-1
.
Presto data type | Bits | Minimum value | Maximum value |
---|---|---|---|
TINYINT | 8 | -128 | 127 |
SMALLINT | 16 | -32,768 | 32,767 |
INTEGER | 32 | -2,147,483,648 | 2,147,483,647 |
BIGINT | 64 | -9,223,372,036,854,775,808 | 9,223,372,036,854,775,807 |
In comparison, base R only ships with one 32-bit integer type, but
the range is slightly different from Presto. R’s integer type ranges
from -2^31+1 = -2,147,483,647
to
2^31-1 = 2,147,483,647
. The one number (out of ~4.3 billion
32-bit integer numbers) that’s a valid INTEGER
in Presto
but not in R is the range lower bound -2,147,483,648
.
Since all TINYINT
, SMALLINT
, and
INTEGER
numbers (except -2,147,483,648
) can be
fit into R’s integer
type, we translate all of them to
integer
s.
(
df.non_bigint_int <- dbGetQuery(
con, "SELECT tinyint, smallint, integer FROM presto_primitive_types"
)
)
#> # A tibble: 3 × 3
#> tinyint smallint integer
#> <int> <int> <int>
#> 1 -128 -32768 -2147483647
#> 2 127 32767 2147483647
#> 3 NA NA NA
We can verify that all three columns in R are integer
types.
purrr::map_chr(df.non_bigint_int, class)
#> tinyint smallint integer
#> "integer" "integer" "integer"
We can also show that mapping the one exception
-2,147,483,648
to R results in an error while the query
runs correctly in Presto. Users whose data contains the
exceptional integer should consider using the BIGINT
type
rather than INTEGER
to encode the data in
Presto.
Since the Presto BIGINT
type is 64-bit, its range is
clearly much wider than base R’s integer
type.
However, by default we still map BIGINT
integers from
Presto to integer
in R, for two reasons.
R doesn’t have native 64-bit integer support yet. Mapping
BIGINT
to a 64-bit integer in R will require additional
package and add extra dependencies.
The 32-bit integer range covers 2+ billion positive integers and 4.3 billion integers in total. Most integer values used in data analysis and modeling can be well covered in that range.
Besides the default mapping to integer
(albeit 32-bit)
type, we offer three other options when it comes BIGINT
mapping.
Firstly, using the bigint = "character"
argument
instructs RPresto
to cast the BIGINT
value to
a character
type. This is particularly useful when
BIGINT
is used to store long IDs rather than large numbers
(i.e., the numbers are not used in arithmetic computations).
Using bigint = "integer64"
makes
RPresto
translate BIGINT
values to a value of
the integer64
S3 class from the bit64
package.
bigint = "numeric"
makes RPresto
store
the BIGINT
value in a numeric
( i.e.,
double
) type.
Whenever we map an integer to another numeric type, we always need to
consider the precision of the mapping, that is, whether there’s any
precision loss during the translation. Among the three translation
options above, the first one (i.e., casting BIGINT
to
character
) doesn’t involve any precision translation, so we
will focus the precision discussion on the other two translations (
bit64::integer64
and numeric
).
On the receiving end, the bit64::integer64
type has a
range from -2^63+1 = -9,223,372,036,854,775,807
to
2^63-1 = 9,223,372,036,854,775,807
.
Comparing the range to Presto’s BIGINT
range, it seems
that in theory the bit64::integer64
range is only
infinitesimally smaller than the BIGINT
range by 1 number
(again, the lower bound number). However, in practice, the range of
BIGINT
values that can be translated into
bit64::integer64
without precision loss is
much smaller.
The limitation comes from how the Presto
REST API communicates data with R. It uses the JSON format to encode
the query result data and sends it to R for RPresto
to
process.
JSON by default encodes integers as double
numbers, so
its precision
is limited to +/-(2^53-1) = +/-9,007,199,254,740,991
.
Any integers outside of this range will lose precision during the
translation.
Since the translation limitation is caused by the JSON format
encoding integers using double
, the same limitation applies
when BIGINT
values are mapped to numeric
types
in R.
bigint = | R type | Range without precision loss | Range with possible precision loss |
---|---|---|---|
character | character | NA | NA |
integer64 | bit64::integer64 | +/-9,007,199,254,740,991 | +/- 9,223,372,036,854,775,807 |
numeric | numeric | +/-9,007,199,254,740,991 | [-9,223,372,036,854,775,808, 9,223,372,036,854,775,807] |
The table created by the create_primitive_types_table()
function has the values from the no-precision-lost range. Below we show
how different bigint
arguments change the output R
types.
bigint = "character"
dbGetQuery(
con, "SELECT bigint FROM presto_primitive_types", bigint = "character"
)
#> # A tibble: 3 × 1
#> bigint
#> <chr>
#> 1 -9007199254740991
#> 2 9007199254740991
#> 3 <NA>
bigint = "integer64"
dbGetQuery(
con, "SELECT bigint FROM presto_primitive_types", bigint = "integer64"
)
#> # A tibble: 3 × 1
#> bigint
#> <int64>
#> 1 -9007199254740991
#> 2 9007199254740991
#> 3 NA
bigint = "numeric"
dbGetQuery(
con, "SELECT bigint FROM presto_primitive_types", bigint = "numeric"
)
#> # A tibble: 3 × 1
#> bigint
#> <dbl>
#> 1 -9007199254740991
#> 2 9007199254740991
#> 3 NA
If you attempt to translate integers outside of the no-precision-loss range, a warning message will show up.
dbGetQuery(
con,
"
SELECT SIGN(bigint) * (ABS(bigint) + 1) AS bigint_precision_loss
FROM presto_primitive_types
",
bigint = "numeric"
)
#> Warning in as.double.integer64(x): integer precision lost while converting to
#> double
#> # A tibble: 3 × 1
#> bigint_precision_loss
#> <dbl>
#> 1 -9007199254740992
#> 2 9007199254740992
#> 3 NA
R only has one floating point data type, double
(its
class is numeric
). All floating numbers are stored in
double precision format (i.e., 64-bit). This matches Presto’s DOUBLE
type, so translation between Presto and R is straightforward.
(
df.floating_point <- dbGetQuery(
con,
"SELECT real, double FROM presto_primitive_types"
)
)
#> # A tibble: 3 × 2
#> real double
#> <dbl> <dbl>
#> 1 1 1
#> 2 2 2
#> 3 NA NA
We can verify that both floating point types are translated to
numeric
.
Presto offers a DECIMAL
data type that has fixed
precision up to 38 digits. This means that it can be used to represent a
very large integer which is obviously outside of the precision limits
mentioned above. We can’t decide one precise R class to translate the
DECIMAL
data type into, so RPresto currently translates the
type into a string (i.e., character
) in R.
VARCHAR
and CHAR
typesVARCHAR
and CHAR
data types in Presto are mapped to R’s character
type.
(
df.characters <- dbGetQuery(
con,
"SELECT varchar, char FROM presto_primitive_types"
)
)
#> # A tibble: 3 × 2
#> varchar char
#> <chr> <chr>
#> 1 abc a
#> 2 def b
#> 3 <NA> <NA>
We can verify the resulting R types to be character
.
VARBINARY
Presto’s VARBINARY
type stores string data in raw bytes. It can be nicely mapped to R’s
raw
type.
(
df.bytes <- dbGetQuery(
con,
"SELECT varbinary FROM presto_primitive_types"
)
)
#> # A tibble: 3 × 1
#> varbinary
#> <list>
#> 1 <raw [3]>
#> 2 <raw [3]>
#> 3 <raw [0]>
We can verify all elements in the resulting column are of
raw
data type.
We can also convert the bytes data back to the string values.
In R, there are three types of date/time data that refer to an instant in time.
A date. Tibbles print this as
<date>
.
A time within a day. Tibbles print this as
<time>
.
A date-time is a date plus a time: it uniquely
identifies an instant in time (typically to the nearest second). Tibbles
print this as <dttm>
. Elsewhere in R these are called
POSIXct
.
We prefer to use the lubridate
package to handle date
and date-time objects. Time objects are not very commonly used and R
lacks a strong native support for it. In RPresto
, we uses
the hms
package to handle time objects.
DATE
The DATE
type is by far the most used date-and-time types. Note that a
DATE
value isn’t tied to any particular time zone, so it
isn’t associated with a unique point in time (i.e., seconds or
microseconds since epoch).
We use base R’s Date
S3 class to translate Presto’s
DATE
type.
(
df.date <- dbGetQuery(
con,
"SELECT date FROM presto_primitive_types"
)
)
#> # A tibble: 3 × 1
#> date
#> <date>
#> 1 2000-01-01
#> 2 2000-01-02
#> 3 NA
We can verify the R type of the column.
TIMESTAMP
The POSIXct
type values, on the other hand, are
associated with a unique point in time. That is, they can be translated
to a unique numeric value that refers to the number of time units
(usually seconds, milliseconds, or microseconds) elapsed since epoch
(i.e., the beginning of time). This is why the mode of a
POSIXct
value in R is numeric and you can call the
as.integer()
function on it.
foo <- lubridate::ymd_hms("2000-01-01 01:02:03", tz = "America/New_York")
mode(foo)
#> [1] "numeric"
as.integer(foo)
#> [1] 946706523
In this case, the integer value refers to the number of seconds elapsed since 1970-01-01 00:00:00 in the UTC time zone.
Presto’s TIMESTAMP
and TIMESTAMP WITH TIME ZONE
types follow the same logic and thus can be mapped to the
POSIXct
type. To ensure consistency, RPresto
always translates the timestamp to match the Presto session’s
timezone.
You can check the session time zone by printing the
session.timezone
slot of the PrestoConnection
object.
Here we get the TIMESTAMP
and
TIMESTAMP WITH TIME ZONE
values from the table.
(
df.posixct <- dbGetQuery(
con,
"SELECT timestamp, timestamp_with_tz FROM presto_primitive_types"
)
)
#> # A tibble: 3 × 2
#> timestamp timestamp_with_tz
#> <dttm> <dttm>
#> 1 2000-01-01 01:02:03 2000-01-01 06:02:03
#> 2 2000-01-02 02:03:04 2000-01-02 07:03:04
#> 3 NA NA
We can check the R types of the two columns.
purrr::map(df.posixct, class)
#> $timestamp
#> [1] "POSIXct" "POSIXt"
#>
#> $timestamp_with_tz
#> [1] "POSIXct" "POSIXt"
We can also verify that the time zone attributes match the session time zone.
INTERVAL
sPresto has two INTERVAL
types: INTERVAL YEAR TO MONTH
and INTERVAL DAY TO SECOND
.
We map both of them to lubridate::Duration-class
.
(
df.duration <- dbGetQuery(
con,
"
SELECT
interval_year_to_month,
interval_day_to_second
FROM presto_primitive_types
"
)
)
#> # A tibble: 3 × 2
#> interval_year_to_month interval_day_to_second
#> <Duration> <Duration>
#> 1 36817200s (~1.17 years) 187506.5s (~2.17 days)
#> 2 73634400s (~2.33 years) 284889.6s (~3.3 days)
#> 3 NA NA