rqdatatable
is an implementation of the rquery
piped
Codd-style relational algebra hosted on data.table
.
rquery
allow the expression of complex transformations as a
series of relational operators and rqdatatable
implements
the operators using data.table
.
A Python
version of
rquery
/rqdatatable
is under initial
development as data_algebra
.
For example scoring a logistic regression model (which requires grouping, ordering, and ranking) is organized as follows. For more on this example please see “Let’s Have Some Sympathy For The Part-time R User”.
library("rqdatatable")
## Loading required package: wrapr
## Loading required package: rquery
# data example
<- build_frame(
dL "subjectID", "surveyCategory" , "assessmentTotal" |
1 , "withdrawal behavior", 5 |
1 , "positive re-framing", 2 |
2 , "withdrawal behavior", 3 |
2 , "positive re-framing", 4 )
<- 0.237
scale
# example rquery pipeline
<- local_td(dL) %.>%
rquery_pipeline extend_nse(.,
probability :=
exp(assessmentTotal * scale)) %.>%
normalize_cols(.,
"probability",
partitionby = 'subjectID') %.>%
pick_top_k(.,
k = 1,
partitionby = 'subjectID',
orderby = c('probability', 'surveyCategory'),
reverse = c('probability', 'surveyCategory')) %.>%
rename_columns(., c('diagnosis' = 'surveyCategory')) %.>%
select_columns(., c('subjectID',
'diagnosis',
'probability')) %.>%
orderby(., cols = 'subjectID')
We can show the expanded form of query tree.
cat(format(rquery_pipeline))
mk_td("dL", c(
"subjectID",
"surveyCategory",
"assessmentTotal")) %.>%
extend(.,
probability := exp(assessmentTotal * 0.237)) %.>%
extend(.,
probability := probability / sum(probability),
partitionby = c('subjectID'),
orderby = c(),
reverse = c()) %.>%
extend(.,
row_number := row_number(),
partitionby = c('subjectID'),
orderby = c('probability', 'surveyCategory'),
reverse = c('probability', 'surveyCategory')) %.>%
select_rows(.,
row_number <= 1) %.>%
rename_columns(.,
c('diagnosis' = 'surveyCategory')) %.>%
select_columns(.,
c('subjectID', 'diagnosis', 'probability')) %.>%
order_rows(.,
c('subjectID'),
reverse = c(),
limit = NULL)
And execute it using data.table
.
ex_data_table(rquery_pipeline)
## subjectID diagnosis probability
## 1 1 withdrawal behavior 0.6706221
## 2 2 positive re-framing 0.5589742
One can also apply the pipeline to new tables.
build_frame(
"subjectID", "surveyCategory" , "assessmentTotal" |
7 , "withdrawal behavior", 5 |
7 , "positive re-framing", 20 ) %.>%
rquery_pipeline
## subjectID diagnosis probability
## 1 7 positive re-framing 0.9722128
Initial bench-marking of rqdatatable
is very favorable
(notes here).
To install rqdatatable
please use
install.packages("rqdatatable")
.
Some related work includes:
data.table
Polars
data algebra
disk.frame
dbplyr
dplyr
dtplyr
maditr
nc
poorman
rquery
SparkR
sparklyr
sqldf
table.express
tidyfast
tidyfst
tidyquery
tidyr
tidytable
(formerly gdt
/tidydt
)–
Note rqdatatable
has an “immediate mode” which allows
direct application of pipelines stages without pre-assembling the
pipeline. “Immediate mode” is a convenience for ad-hoc analyses, and has
some negative performance impact, so we encourage users to build
pipelines for most work. Some notes on the issue can be found here.
rqdatatable
implements the rquery
grammar
in the style of a “Turing or Cook reduction” (implementing the result in
terms of multiple oracle calls to the related system).
rqdatatable
is intended for “simple column names”, in
particular as rqdatatable
often uses eval()
to
work over data.table
escape characters such as
“\
” and “\\
” are not reliable in column names.
Also rqdatatable
does not support tables with no
columns.