One of the design goals of the cdata
R
package is that data
occurs in records, and records may be a pattern of cells in a groups of
rows.
The allows cdata
to support very powerful and arbitrary
record transforms in one or two steps. Using “row records” (that is
records that are exactly one row) as an intermediate lets us take just
about any record shape to just about any record shape: first convert to
row-records, then re-block the data into arbitrary record shapes (please
see here
and here
for the concepts).
But as with all general ideas, it is much easier to see what we mean by the above with a concrete example. Let’s consider the following artificial (but simple) example. Suppose we have the following data.
library("cdata")
data <- wrapr::build_frame(
"record_id" , "row" , "col1", "col2", "col3" |
1 , "row1", 1 , 2 , 3 |
1 , "row2", 4 , 5 , 6 |
1 , "row3", 7 , 8 , 9 |
2 , "row1", 11 , 12 , 13 |
2 , "row2", 14 , 15 , 16 |
2 , "row3", 17 , 18 , 19 )
knitr::kable(data)
record_id | row | col1 | col2 | col3 |
---|---|---|---|---|
1 | row1 | 1 | 2 | 3 |
1 | row2 | 4 | 5 | 6 |
1 | row3 | 7 | 8 | 9 |
2 | row1 | 11 | 12 | 13 |
2 | row2 | 14 | 15 | 16 |
2 | row3 | 17 | 18 | 19 |
In the above the records are the triples of rows with matching
record_id
and the different rows within the record are
identified by the value in the row
column. So The data
items are named by the triplet record_id
, row
and renaming column name (col1
, col2
, or
col2
). This sort of naming of values is essentially Codd’s
“guaranteed access rule”.
Suppose we want to transpose each of the records- swapping the row
and column notions. With cdata
this is easy. First you
design a transform to flatten each complex record into a single wide row
(using the design steps taught here).
Essentially that is just specifying the following control variables. We
define how to identify records (the key columns) and the structure of
the records (giving the interior of the record arbitrary names we will
re-use later).
recordKeys = 'record_id'
incoming_shape <- wrapr::qchar_frame(
"row" , "col1", "col2", "col3" |
"row1", v11 , v12 , v13 |
"row2", v21 , v22 , v23 |
"row3", v31 , v32 , v33 )
And we specify (using the same principles) the desired final record shape, re-using the interior names from the first step to show where values are to be mapped.
outgoing_shape <- wrapr::qchar_frame(
"column_label" , "c_row1", "c_row2", "c_row3" |
"rec_col1" , v11 , v21 , v31 |
"rec_col2" , v12 , v22 , v32 |
"rec_col3" , v13 , v23 , v33 )
Once you have done this you specify the overall transform by building a layout specifying the incoming and outgoing record shapes.
layout <- layout_specification(
incoming_shape = incoming_shape,
outgoing_shape = outgoing_shape,
recordKeys = recordKeys)
print(layout)
#> {
#> in_record <- wrapr::qchar_frame(
#> "record_id" , "row" , "col1", "col2", "col3" |
#> . , "row1", v11 , v12 , v13 |
#> . , "row2", v21 , v22 , v23 |
#> . , "row3", v31 , v32 , v33 )
#> in_keys <- c('record_id', 'row')
#>
#> # becomes
#>
#> out_record <- wrapr::qchar_frame(
#> "record_id" , "column_label", "c_row1", "c_row2", "c_row3" |
#> . , "rec_col1" , v11 , v21 , v31 |
#> . , "rec_col2" , v12 , v22 , v32 |
#> . , "rec_col3" , v13 , v23 , v33 )
#> out_keys <- c('record_id', 'column_label')
#>
#> # args: c(checkNames = TRUE, checkKeys = TRUE, strict = FALSE, allow_rqdatatable = FALSE)
#> }
This layout specification or controller can then perform the transform.
record_id | column_label | c_row1 | c_row2 | c_row3 |
---|---|---|---|---|
1 | rec_col1 | 1 | 4 | 7 |
1 | rec_col2 | 2 | 5 | 8 |
1 | rec_col3 | 3 | 6 | 9 |
2 | rec_col1 | 11 | 14 | 17 |
2 | rec_col2 | 12 | 15 | 18 |
2 | rec_col3 | 13 | 16 | 19 |
And the transform is done, each record has been transposed.
The principle is “draw a picture.” First we draw a picture of the block record structure we have, and then we draw a picture of the block record structure we want.
As you have seen, we have complete freedom to re-name columns and record-piece labels (the labels that tell us which portion of a block-record each row fits into).
If you don’t want to use pipe notation, you can use the method
layout_by()
(which takes a layout specification as an
argument) or the method convert_records()
(which takes the
components of the transform specification as separate arguments).
record_id | column_label | c_row1 | c_row2 | c_row3 |
---|---|---|---|---|
1 | rec_col1 | 1 | 4 | 7 |
1 | rec_col2 | 2 | 5 | 8 |
1 | rec_col3 | 3 | 6 | 9 |
2 | rec_col1 | 11 | 14 | 17 |
2 | rec_col2 | 12 | 15 | 18 |
2 | rec_col3 | 13 | 16 | 19 |
cr <- convert_records(
data,
keyColumns = recordKeys,
incoming_shape = incoming_shape,
outgoing_shape = outgoing_shape)
knitr::kable(cr)
record_id | column_label | c_row1 | c_row2 | c_row3 |
---|---|---|---|---|
1 | rec_col1 | 1 | 4 | 7 |
1 | rec_col2 | 2 | 5 | 8 |
1 | rec_col3 | 3 | 6 | 9 |
2 | rec_col1 | 11 | 14 | 17 |
2 | rec_col2 | 12 | 15 | 18 |
2 | rec_col3 | 13 | 16 | 19 |
A nifty bonus is: if the transformation is “faithful” (preserves
enough cells and labels), then it is invertible and in fact easy to
invert (by the t()
transpose/adjoint function).
inv_layout <- t(layout)
print(inv_layout)
#> {
#> in_record <- wrapr::qchar_frame(
#> "record_id" , "column_label", "c_row1", "c_row2", "c_row3" |
#> . , "rec_col1" , v11 , v21 , v31 |
#> . , "rec_col2" , v12 , v22 , v32 |
#> . , "rec_col3" , v13 , v23 , v33 )
#> in_keys <- c('record_id', 'column_label')
#>
#> # becomes
#>
#> out_record <- wrapr::qchar_frame(
#> "record_id" , "row" , "col1", "col2", "col3" |
#> . , "row1", v11 , v12 , v13 |
#> . , "row2", v21 , v22 , v23 |
#> . , "row3", v31 , v32 , v33 )
#> out_keys <- c('record_id', 'row')
#>
#> # args: c(checkNames = TRUE, checkKeys = TRUE, strict = FALSE, allow_rqdatatable = FALSE)
#> }
data %.>%
layout %.>%
inv_layout %.>%
knitr::kable(.)
record_id | row | col1 | col2 | col3 |
---|---|---|---|---|
1 | row1 | 1 | 2 | 3 |
1 | row2 | 4 | 5 | 6 |
1 | row3 | 7 | 8 | 9 |
2 | row1 | 11 | 12 | 13 |
2 | row2 | 14 | 15 | 16 |
2 | row3 | 17 | 18 | 19 |
Also these conversions can also be translated into rquery
operators, and therefore saved to be run either in memory or directly on
a database.
table_desciption <- rquery::local_td(data)
ops <- table_desciption %.>%
layout
cat(format(ops))
#> mk_td("data", c(
#> "record_id",
#> "row",
#> "col1",
#> "col2",
#> "col3")) %.>%
#> non_sql_node(., CREATE TEMPORARY TABLE "OUT" AS SELECT a."record_id" "record_id", MAX( CASE WHEN CAST(a."row" AS VARCHAR) = 'row1' THEN a."col1" ELSE NULL END ) "v11", MAX( CASE WHEN CAST(a."row" AS VARCHAR) = 'row1' THEN a."col2" ELSE NULL END ) "v12", MAX( CASE WHEN CAST(a."row" AS VARCHAR) = 'row1' THEN a."col3" ELSE NULL END ) "v13", MAX( CASE WHEN CAST(a."row" AS VARCHAR) = 'row2' THEN a."col1" ELSE NULL END ) "v21", MAX( CASE WHEN CAST(a."row" AS VARCHAR) = 'row2' THEN a."col2" ELSE NULL END ) "v22", MAX( CASE WHEN CAST(a."row" AS VARCHAR) = 'row2' THEN a."col3" ELSE NULL END ) "v23", MAX( CASE WHEN CAST(a."row" AS VARCHAR) = 'row3' THEN a."col1" ELSE NULL END ) "v31", MAX( CASE WHEN CAST(a."row" AS VARCHAR) = 'row3' THEN a."col2" ELSE NULL END ) "v32", MAX( CASE WHEN CAST(a."row" AS VARCHAR) = 'row3' THEN a."col3" ELSE NULL END ) "v33" FROM "IN" a GROUP BY a."record_id") %.>%
#> non_sql_node(., CREATE TEMPORARY TABLE "OUT" AS SELECT a."record_id", b."column_label", CASE WHEN CAST(b."column_label" AS VARCHAR) = 'rec_col1' THEN a."v11" WHEN CAST(b."column_label" AS VARCHAR) = 'rec_col2' THEN a."v12" WHEN CAST(b."column_label" AS VARCHAR) = 'rec_col3' THEN a."v13" ELSE NULL END AS "c_row1", CASE WHEN CAST(b."column_label" AS VARCHAR) = 'rec_col1' THEN a."v21" WHEN CAST(b."column_label" AS VARCHAR) = 'rec_col2' THEN a."v22" WHEN CAST(b."column_label" AS VARCHAR) = 'rec_col3' THEN a."v23" ELSE NULL END AS "c_row2", CASE WHEN CAST(b."column_label" AS VARCHAR) = 'rec_col1' THEN a."v31" WHEN CAST(b."column_label" AS VARCHAR) = 'rec_col2' THEN a."v32" WHEN CAST(b."column_label" AS VARCHAR) = 'rec_col3' THEN a."v33" ELSE NULL END AS "c_row3" FROM "IN" a CROSS JOIN "rrtbl_30003829942041621262_0000000002" b )
rquery::column_names(ops)
#> [1] "record_id" "column_label" "c_row1" "c_row2" "c_row3"
if(requireNamespace("DBI", quietly = TRUE) &&
requireNamespace("RSQLite", quietly = TRUE)) {
raw_connection <- DBI::dbConnect(RSQLite::SQLite(),
":memory:")
RSQLite::initExtension(raw_connection)
db <- rquery::rquery_db_info(
connection = raw_connection,
is_dbi = TRUE,
connection_options = rquery::rq_connection_tests(raw_connection))
db_td <- rquery::rq_copy_to(db, "data", data)
ops %.>%
db %.>%
knitr::kable(.) %.>%
print(.)
DBI::dbDisconnect(raw_connection)
}
#>
#>
#> | record_id|column_label | c_row1| c_row2| c_row3|
#> |---------:|:------------|------:|------:|------:|
#> | 1|rec_col1 | 1| 4| 7|
#> | 1|rec_col2 | 2| 5| 8|
#> | 1|rec_col3 | 3| 6| 9|
#> | 2|rec_col1 | 11| 14| 17|
#> | 2|rec_col2 | 12| 15| 18|
#> | 2|rec_col3 | 13| 16| 19|
And that is some of the generality of cdata
transforms.