The nc::capture_all_str
function is for the common case
of extracting each match from a multi-line text file (a single large
subject string). In this section we demonstrate how to extract data
tables from such loosely structured text data. For example we consider
the following track
hub meta-data file:
trackDb.txt.gz <- system.file(
"extdata", "trackDb.txt.gz", package="nc")
trackDb.vec <- readLines(trackDb.txt.gz)
Some representative lines from that file are shown below.
cat(trackDb.vec[78:107], sep="\n")
#> track peaks_summary
#> type bigBed 5
#> shortLabel _model_peaks_summary
#> longLabel Regions with a peak in at least one sample
#> visibility pack
#> itemRgb off
#> spectrum on
#> bigDataUrl http://hubs.hpc.mcgill.ca/~thocking/PeakSegFPOP-/peaks_summary.bigBed
#>
#>
#> track bcell_McGill0091
#> parent bcell
#> container multiWig
#> type bigWig
#> shortLabel bcell_McGill0091
#> longLabel bcell | McGill0091
#> graphType points
#> aggregate transparentOverlay
#> showSubtrackColorOnUi on
#> maxHeightPixels 25:12:8
#> visibility full
#> autoScale on
#>
#> track bcell_McGill0091Coverage
#> bigDataUrl http://hubs.hpc.mcgill.ca/~thocking/PeakSegFPOP-/samples/bcell/McGill0091/coverage.bigWig
#> shortLabel bcell_McGill0091Coverage
#> longLabel bcell | McGill0091 | Coverage
#> parent bcell_McGill0091
#> type bigWig
#> color 141,211,199
Each block of text begins with “track” and includes several lines of data before the block ends with two consecutive newlines. That pattern is coded below using a regex:
tracks.dt <- nc::capture_all_str(
trackDb.vec,
"track ",
track="\\S+",
fields="(?:\n[^\n]+)*",
"\n")
str(tracks.dt)
#> Classes 'data.table' and 'data.frame': 123 obs. of 2 variables:
#> $ track : chr "bcell" "kidneyCancer" "kidney" "leukemiaCD19CD10BCells" ...
#> $ fields: chr "\nsuperTrack on show\nshortLabel bcell\nlongLabel bcell ChIP-seq samples" "\nsuperTrack on show\nshortLabel kidneyCancer\nlongLabel kidneyCancer ChIP-seq samples" "\nsuperTrack on show\nshortLabel kidney\nlongLabel kidney ChIP-seq samples" "\nsuperTrack on show\nshortLabel leukemiaCD19CD10BCells\nlongLabel leukemiaCD19CD10BCells ChIP-seq samples" ...
#> - attr(*, ".internal.selfref")=<externalptr>
The result is a data.table with one row for each track block that
matches the regex. There are two character columns: track
is a unique name, and fields
is a string with the rest of
the data in that block:
tracks.dt[, .(track, fields.start=substr(fields, 1, 30))]
#> track fields.start
#> <char> <char>
#> 1: bcell \nsuperTrack on show\nshortLabel
#> 2: kidneyCancer \nsuperTrack on show\nshortLabel
#> 3: kidney \nsuperTrack on show\nshortLabel
#> 4: leukemiaCD19CD10BCells \nsuperTrack on show\nshortLabel
#> 5: monocyte \nsuperTrack on show\nshortLabel
#> ---
#> 119: tcell_McGill0106Coverage \n bigDataUrl http://hubs.hpc.
#> 120: tcell_McGill0106Peaks \n bigDataUrl http://hubs.hpc.
#> 121: tcell_McGill0107 \n parent tcell\n container mult
#> 122: tcell_McGill0107Coverage \n bigDataUrl http://hubs.hpc.
#> 123: tcell_McGill0107Peaks \n bigDataUrl http://hubs.hpc.
Each block has a variable number of lines/fields. Each line starts with a field name, followed by a space, followed by the field value. That regex is coded below:
(fields.dt <- tracks.dt[, nc::capture_all_str(
fields,
"\\s+",
variable=".*?",
" ",
value="[^\n]+"),
by=track])
#> track variable value
#> <char> <char> <char>
#> 1: bcell superTrack on show
#> 2: bcell shortLabel bcell
#> 3: bcell longLabel bcell ChIP-seq samples
#> 4: kidneyCancer superTrack on show
#> 5: kidneyCancer shortLabel kidneyCancer
#> ---
#> 899: tcell_McGill0107Peaks shortLabel tcell_McGill0107Peaks
#> 900: tcell_McGill0107Peaks longLabel tcell | McGill0107 | Peaks
#> 901: tcell_McGill0107Peaks parent tcell_McGill0107
#> 902: tcell_McGill0107Peaks type bigWig
#> 903: tcell_McGill0107Peaks color 0,0,0
str(fields.dt)
#> Classes 'data.table' and 'data.frame': 903 obs. of 3 variables:
#> $ track : chr "bcell" "bcell" "bcell" "kidneyCancer" ...
#> $ variable: chr "superTrack" "shortLabel" "longLabel" "superTrack" ...
#> $ value : chr "on show" "bcell" "bcell ChIP-seq samples" "on show" ...
#> - attr(*, ".internal.selfref")=<externalptr>
Note that because by=track
was specified,
nc::capture_all_str
is called for each unique value of
track
(i.e. each row). The results are combined into a
single data.table with one row for each field. This data.table can be
easily queried, e.g.
fields.dt[
J("tcell_McGill0107Coverage", "bigDataUrl"),
value,
on=.(track, variable)]
#> [1] "http://hubs.hpc.mcgill.ca/~thocking/PeakSegFPOP-/samples/tcell/McGill0107/coverage.bigWig"
fields.dt[, .(count=.N), by=variable][order(count)]
#> variable count
#> <char> <int>
#> 1: itemRgb 4
#> 2: spectrum 4
#> 3: superTrack 8
#> 4: container 37
#> 5: graphType 37
#> 6: aggregate 37
#> 7: showSubtrackColorOnUi 37
#> 8: maxHeightPixels 37
#> 9: autoScale 37
#> 10: visibility 41
#> 11: color 74
#> 12: bigDataUrl 78
#> 13: parent 111
#> 14: type 115
#> 15: shortLabel 123
#> 16: longLabel 123
For more information about data.table syntax, read
vignette("datatable-intro", package="data.table")
.
In the examples above we extracted all fields from all tracks (using two regexes, one for the track, one for the field). In the example below we extract only the track name, split into separate columns (using a single regex for the track).
cell.sample.type <- list(
cellType="[^ ]*?",
"_",
sampleName=list(
"McGill",
sampleID="[0-9]+", as.integer),
dataType="Coverage|Peaks")
nc::capture_all_str(trackDb.vec, cell.sample.type)
#> cellType sampleName sampleID dataType
#> <char> <char> <int> <char>
#> 1: bcell McGill0091 91 Coverage
#> 2: bcell McGill0091 91 Coverage
#> 3: bcell McGill0091 91 Peaks
#> 4: bcell McGill0091 91 Peaks
#> 5: bcell McGill0322 322 Coverage
#> ---
#> 144: tcell McGill0106 106 Peaks
#> 145: tcell McGill0107 107 Coverage
#> 146: tcell McGill0107 107 Coverage
#> 147: tcell McGill0107 107 Peaks
#> 148: tcell McGill0107 107 Peaks
Note that the pattern above defines nested capture groups via named lists (e.g. sampleID is a subset of sampleName). The pattern below matches either the previously specified track pattern, or any other type of track name:
sample.or.anything <- list(
cell.sample.type,
"|",
"[^\n]+")
track.pattern.old <- list(
"track ",
track=sample.or.anything)
nc::capture_all_str(trackDb.vec, track.pattern.old)
#> track cellType sampleName sampleID dataType
#> <char> <char> <char> <int> <char>
#> 1: bcell NA
#> 2: kidneyCancer NA
#> 3: kidney NA
#> 4: leukemiaCD19CD10BCells NA
#> 5: monocyte NA
#> ---
#> 119: tcell_McGill0106Coverage tcell McGill0106 106 Coverage
#> 120: tcell_McGill0106Peaks tcell McGill0106 106 Peaks
#> 121: tcell_McGill0107 NA
#> 122: tcell_McGill0107Coverage tcell McGill0107 107 Coverage
#> 123: tcell_McGill0107Peaks tcell McGill0107 107 Peaks
Notice the repetition of track
in the pattern above.
This can be avoided by using the nc::field
helper function,
which takes three arguments, that are pasted together to form a
pattern:
field.name
is used as a pattern, and as the capture
group (column) name for the pattern specified in the third
argument.between.pattern
is a pattern that matches between the
other two patterns.field.pattern
is the pattern that matches the text to
be extracted in a capture group.The example above can thus be re-written as below, avoiding the
repetition of track
which was present above:
track.pattern <- nc::field("track", " ", sample.or.anything)
nc::capture_all_str(trackDb.vec, track.pattern)
#> track cellType sampleName sampleID dataType
#> <char> <char> <char> <int> <char>
#> 1: bcell NA
#> 2: kidneyCancer NA
#> 3: kidney NA
#> 4: leukemiaCD19CD10BCells NA
#> 5: monocyte NA
#> ---
#> 119: tcell_McGill0106Coverage tcell McGill0106 106 Coverage
#> 120: tcell_McGill0106Peaks tcell McGill0106 106 Peaks
#> 121: tcell_McGill0107 NA
#> 122: tcell_McGill0107Coverage tcell McGill0107 107 Coverage
#> 123: tcell_McGill0107Peaks tcell McGill0107 107 Peaks
Finally we use field
again to match the type column:
any.lines.pattern <- "(?:\n[^\n]+)*"
nc::capture_all_str(
trackDb.vec,
track.pattern,
any.lines.pattern,
"\\s+",
nc::field("type", " ", "[^\n]+"))
#> track cellType sampleName sampleID dataType type
#> <char> <char> <char> <int> <char> <char>
#> 1: all_labels NA bigBed 9
#> 2: problems NA bigBed 3
#> 3: jointProblems NA bigBed 3
#> 4: peaks_summary NA bigBed 5
#> 5: bcell_McGill0091 NA bigWig
#> ---
#> 111: tcell_McGill0106Coverage tcell McGill0106 106 Coverage bigWig
#> 112: tcell_McGill0106Peaks tcell McGill0106 106 Peaks bigWig
#> 113: tcell_McGill0107 NA bigWig
#> 114: tcell_McGill0107Coverage tcell McGill0107 107 Coverage bigWig
#> 115: tcell_McGill0107Peaks tcell McGill0107 107 Peaks bigWig
Exercise for the reader (easy): modify the above regex in order to
capture the bigDataUrl field, and three additional columns (red, green,
blue) from the color field. Assume that bigDataUrl
occurs
before color
in each track. Note that this is a limitation
of the single regex approach — using two regex, as described in previous
sections, could extract any/all fields, even if they appear in different
orders in different tracks.
Exercise for the reader (hard): note that the last code block only
matches tracks which define the type field. How would you optionally
match the type field? Hint: the current any.lines.pattern
can match the type field.
Thanks to Marc Tollis for providing the example data used in this section (from the SweeD bioinformatics program). Some representative lines from one output file are shown below.
info.txt.gz <- system.file(
"extdata", "SweeD_Info.txt.gz", package="nc")
info.vec <- readLines(info.txt.gz)
info.vec[20:50]
#> [1] " Total number of samples in the VCF:\t13"
#> [2] " Samples excluded from the analysis:\t6"
#> [3] ""
#> [4] ""
#> [5] " Alignment 1"
#> [6] ""
#> [7] "\t\tChromosome:\t\tscaffold_0"
#> [8] "\t\tSequences:\t\t14"
#> [9] "\t\tSites:\t\t\t1670366"
#> [10] "\t\tDiscarded sites:\t1264068"
#> [11] ""
#> [12] "\t\tProcessing:\t\t155.53 seconds"
#> [13] ""
#> [14] "\t\tPosition:\t\t8.936200e+07"
#> [15] "\t\tLikelihood:\t\t4.105582e+02"
#> [16] "\t\tAlpha:\t\t\t6.616326e-06"
#> [17] ""
#> [18] ""
#> [19] " Alignment 2"
#> [20] ""
#> [21] "\t\tChromosome:\t\tscaffold_1"
#> [22] "\t\tSequences:\t\t14"
#> [23] "\t\tSites:\t\t\t1447008"
#> [24] "\t\tDiscarded sites:\t1093595"
#> [25] ""
#> [26] "\t\tProcessing:\t\t138.83 seconds"
#> [27] ""
#> [28] "\t\tPosition:\t\t8.722482e+07"
#> [29] "\t\tLikelihood:\t\t2.531514e+02"
#> [30] "\t\tAlpha:\t\t\t1.031963e-05"
#> [31] ""
The Alignment numbers must be matched with the numbers before slashes in the other file,
report.txt.gz <- system.file(
"extdata", "SweeD_Report.txt.gz", package="nc")
report.vec <- readLines(report.txt.gz)
cat(report.vec[1:10], sep="\n")
#>
#> //1
#> Position Likelihood Alpha
#> 700.0000 4.637328e-03 2.763840e+02
#> 130585.6172 3.781283e-01 8.490200e-04
#> 260471.2344 3.602315e-02 4.691340e-03
#> 390356.8516 7.618749e-01 5.377668e-04
#> 520242.4688 2.979971e-08 1.411765e-01
#> 650128.0859 3.552965e-03 7.790821e-03
#> 780013.7031 4.637359e-03 1.727400e-02
cat(report.vec[1000:1010], sep="\n")
#> 129366774.7188 1.218965e-01 2.215489e-02
#> 129496660.3359 1.165627e-02 3.384931e-02
#> 129626545.9531 2.233934e-02 3.602669e-02
#> 129756434.0000 3.850623e-01 4.812648e+01
#>
#> //2
#> Position Likelihood Alpha
#> 135.0000 7.282316e-01 3.163686e+01
#> 111533.0625 2.548831e+00 4.932014e-04
#> 222931.1250 1.369720e-02 1.044774e+00
#> 334329.1875 7.118828e+00 3.965791e-04
The goal is to produce a bed file, which has tab-separated values with four columns: chrom, chromStart, chromEnd, Likelihood. The chrom values appear in the info file (Chromosome) so we will need to join the two files based on alignment ID. First we capture all alignments in the info file:
(info.dt <- nc::capture_all_str(
info.vec,
"Alignment ",
alignment="[0-9]+",
"\n\n\t\tChromosome:\t\t",
chrom=".*",
"\n"))
#> alignment chrom
#> <char> <char>
#> 1: 1 scaffold_0
#> 2: 2 scaffold_1
#> 3: 3 scaffold_2
#> 4: 4 scaffold_3
#> 5: 5 scaffold_4
#> 6: 6 scaffold_5
#> 7: 7 scaffold_6
#> 8: 8 scaffold_7
#> 9: 9 scaffold_8
#> 10: 10 scaffold_9
Then we capture all alignment/csv blocks in the report file:
(report.dt <- nc::capture_all_str(
report.vec,
"//",
alignment="[0-9]+",
"\n",
csv="[^/]+"
)[, {
data.table::fread(text=csv)
}, by=alignment])
#> alignment Position Likelihood Alpha
#> <char> <num> <num> <num>
#> 1: 1 700.0 4.637328e-03 2.763840e+02
#> 2: 1 130585.6 3.781283e-01 8.490200e-04
#> 3: 1 260471.2 3.602315e-02 4.691340e-03
#> 4: 1 390356.9 7.618749e-01 5.377668e-04
#> 5: 1 520242.5 2.979971e-08 1.411765e-01
#> ---
#> 9996: 10 82991564.8 8.051006e-03 1.357819e-03
#> 9997: 10 83074967.8 7.048433e-03 1.825764e-03
#> 9998: 10 83158370.8 1.012360e-07 7.999999e-03
#> 9999: 10 83241773.8 3.977189e-08 9.999997e-01
#> 10000: 10 83325174.0 3.980538e-08 1.200000e+03
Note that because by=alignment
was specified,
fread
is called for each unique value of
alignment
(i.e. each row). The results are combined into a
single data.table with all of the csv data from the original file, plus
the additional alignment
column. Next, we join this table
to the previous table in order to get the chrom
column:
(join.dt <- report.dt[info.dt, on=.(alignment)])
#> alignment Position Likelihood Alpha chrom
#> <char> <num> <num> <num> <char>
#> 1: 1 700.0 4.637328e-03 2.763840e+02 scaffold_0
#> 2: 1 130585.6 3.781283e-01 8.490200e-04 scaffold_0
#> 3: 1 260471.2 3.602315e-02 4.691340e-03 scaffold_0
#> 4: 1 390356.9 7.618749e-01 5.377668e-04 scaffold_0
#> 5: 1 520242.5 2.979971e-08 1.411765e-01 scaffold_0
#> ---
#> 9996: 10 82991564.8 8.051006e-03 1.357819e-03 scaffold_9
#> 9997: 10 83074967.8 7.048433e-03 1.825764e-03 scaffold_9
#> 9998: 10 83158370.8 1.012360e-07 7.999999e-03 scaffold_9
#> 9999: 10 83241773.8 3.977189e-08 9.999997e-01 scaffold_9
#> 10000: 10 83325174.0 3.980538e-08 1.200000e+03 scaffold_9
Finally the desired bed table can be created via
join.dt[, .(
chrom,
chromStart=as.integer(Position-1),
chromEnd=as.integer(Position),
Likelihood)]
#> chrom chromStart chromEnd Likelihood
#> <char> <int> <int> <num>
#> 1: scaffold_0 699 700 4.637328e-03
#> 2: scaffold_0 130584 130585 3.781283e-01
#> 3: scaffold_0 260470 260471 3.602315e-02
#> 4: scaffold_0 390355 390356 7.618749e-01
#> 5: scaffold_0 520241 520242 2.979971e-08
#> ---
#> 9996: scaffold_9 82991563 82991564 8.051006e-03
#> 9997: scaffold_9 83074966 83074967 7.048433e-03
#> 9998: scaffold_9 83158369 83158370 1.012360e-07
#> 9999: scaffold_9 83241772 83241773 3.977189e-08
#> 10000: scaffold_9 83325173 83325174 3.980538e-08
Exercise for the reader (easy): notice that the code above for
creating info.dt
involves repetition in the pattern and
group names (alignment
, Alignment
,
chrom
, Chromosome
). Re-write the pattern using
nc::field
in order to eliminate that repetition.
Exercise for the reader (hard): notice that Chromosome is only the
first field – how could you extract the other fields as well? Hint: use
nc::field
in a helper function in order to avoid
repetition.