Last updated 10 February 2024
daiR
has many different functionalities, but the core
one is to provide access to the Google Document AI API so you can OCR
your documents. That procedure is fairly straightforward: you make a
processing call with either dai_sync()
or
dai_async()
– depending on whether you want synchronous or
asynchronous processing – and then you retrieve the plaintext with
get_text()
.
The quickest and easiest way to OCR with Document AI (DAI) is through synchronous processing. You simply pass an image file or a pdf (of up to 5 pages) to the processor and get the result into your R environment within seconds.
We can try with a sample pdf from the CIA’s Freedom of Information Act Electronic Reading Room:
library(daiR)
setwd(tempdir())
url <- "https://www.cia.gov/readingroom/docs/AGH%2C%20LASLO_0011.pdf"
download.file(url, "CIA.pdf")
We send it to Document AI with dai_sync()
and store the
HTTP response in an object.
We then pass the response object to get_text()
, which
extracts the text identified by Document AI.
What if we have many documents? dai_sync()
is not
vectorized, but you can iterate with it over vectors of filepaths. For
the sake of illustration, let’s download a second PDF.
We now want to apply the functions dai_sync()
and
get_text()
iteratively over the files CIA.pdf
and CIA2.pdf
. In such cases you probably want to preserve
the extracted text in .txt
files along the way. You can do
this by setting the parameter save_to_file
in
get_text()
to TRUE. This function also has a parameter
outfile_stem
which allows you to specify the namestem of
the .txt
file. We can get the stem from each file by
combining fs::path_ext_remove()
and
basename()
.
I further recommend adding a small pause so as not to run into rate limit issues. A print statement is also useful for keeping track of where you are in case of an error or interruption. A sample script might thus look like this:
## NOT RUN
myfiles <- list.files(pattern = "*.pdf")
for (i in seq_along(myfiles)) {
print(paste("Processing file", i, "of", length(myfiles)))
resp <- dai_sync(myfiles[i])
stem <- fs::path_ext_remove(basename(myfiles[i]))
get_text(resp, save_to_file = TRUE, outfile_stem = stem)
Sys.sleep(2)
}
If you now run list.files()
, you will see that the code
generated two files named CIA.txt
and CIA2.txt
respectively.
Synchronous processing is very convenient, but has two limitations.
One is that OCR accuracy may be slightly reduced compared with
asynchronous processing, because dai_sync()
converts the
source file to a lightweight, grayscale image before passing it to DAI.
The other is scaling; If you have a large pdf or many files, it is
usually preferable to process them asynchronously.
In asynchronous (offline) processing, you don’t send DAI the actual
document, but rather its location on Google Storage so that DAI can
process it “in its own time”. While slightly slower than synchronous
OCR, it allows for batch processing and makes the process less
vulnerable to interruptions (like laptop battery death or inadvertent
closing of your console). Unlike dai_sync
,
dai_async()
is vectorized, so you can send multiple files
with a single call.
The first step is to use the package googleCloudStorageR
to upload the source file(s) to a Google Storage bucket where DAI can
find them. The following assumes that you have already configured Google
Storage and set up a default bucket as described in this
vignette.
Let’s upload our two CIA documents. I am assuming the filepaths are
still stored in the vector myfiles
we created earlier.
Let’s check that our files made it safely:
We can now use dai_async()
to tell Document AI to
process these files. At this stage it is crucial to know that
dai_async()
takes as its main argument the filenames in
the bucket, NOT the filenames or -paths on your local drive. In
this particular example there is no difference, but that is not always
the case. A common error scenario is when you use a vector of full local
filepaths
(e.g. files <- c("/path/to/file1.pdf", "/path/to/file2.pdf")
)
to upload the files, saving them there with their basenames
(file1.pdf
and file2.pdf
). When you then try
to pass the same vector to dai_sync()
, the processing fails
because Document AI cannot find /path/to/file1.pdf
in the
bucket, only file1.pdf
.
It is therefore good practice to use the output of
gcs_list_objects()
to create the vector that you pass to
dai_async
. From the vignette on
Google Storage, we remember that if we store the output of
gcs_list_objects()
in a dataframe named
contents
, the filenames will be in
contents$name
.
If your call returned “status: 200”, it was accepted by the API. Note that this does NOT mean that the processing itself was successful, only that the request went through. For example, if there are errors in your filepaths, DAI will create empty JSON files in the folder you provided. If you see JSON files of around 70 bytes each in the destination folder, you know there was something wrong with your filenames. Other things too can cause the processing to fail, for example a corrupt file or a format that DAI cannot handle.
You can check the status of a job with dai_status()
.
Just pass the response object from your dai_async()
call
into the parentheses, like so:
This will tell you whether the job is “RUNNING”, “FAILED”, or
“SUCCEEDED”. It won’t tell you how much time remains, but in my
experience, processing takes about 5-20 seconds per page. To find out
when it’s done, you can either rerun dai_status()
till it
says “SUCCEEDED”, or you can use the function dai_notify()
,
which will check the status for you in the background and beep when the
job is finished.
When the processing is done, there will be JSON output files waiting for you in the bucket. Let’s take a look.
Output file names look cryptic, but there’s a logic to them, namely:
"<job_number>/<document_number>/<filename>-<shard_number>.json"
.
Our file will thus take the form
"<job_number>/0/CIA-0.json"
, with
<job_number>
changing from one processing call to the
next.
These JSON files contain the extracted text plus a wealth of other
data, such as the location of each word on the page and a binary version
of the original image. In order to get to this information we need to
download them to our local drive. Because these output files have
unpredictable names, it is often easiest to simply search for all files
ending in *.json
using grep()
or
stringr::str_detect()
.
We can then download them with gcs_get_object()
.
If you now run list.files()
again, you should see
CIA-0.json
and CIA2-0.json
in our working
directory.
To get the text from a DAI JSON file, we can use
get_text()
, but we have to specify
type = "async"
so that the function knows it is being
served a JSON file and not a response object.
To get the text from several JSON files, we just iterate over them,
setting save_to_file
to TRUE. Unlike in the
dai_sync()
earlier, we don’t need to specify
outfile_stem
, because get_text()
has the names
of the JSON files and uses their stems to create the
.txt
s.
local_jsons <- list.files(pattern = "*.json")
map(local_jsons, ~ get_text(.x, type = "async", save_to_file = TRUE))
Running list.files()
one last time, you should have two
new files named CIA-0.txt
and CIA2-0.txt
.
Although dai_async()
takes batches of files, it is
constrained by Google’s rate limits.
Currently, a dai_async()
call can contain maximum 50 files
(a multi-page pdf counts as one file), and you can not have more than 5
batch requests and 10 000 pages undergoing processing at any one
time.
Therefore, if you’re looking to process a large batch, you need to
spread the dai_async()
calls out over time. While you can
split up your corpus into sets of 50 files and batch process those, the
simplest solution is to make a function that sends files off
individually with a small wait in between. Say we have a vector called
big_batch
containing thousands of filenames. First we would
make a function like this:
Then we would iterate it over our file vector:
This will hold up your console for a while, so it may be worth doing in the background as an RStudio job.
Finding the optimal wait time for Sys.sleep()
may
require some trial and error. As a rule of thumb, it should approximate
the time it takes for DAI to process one of your files. This,
in turn, depends on the size of the files, for a 100-page pdf will take
a lot longer to process than a single-page one. In my experience, a
10-second interval is ample time for a batch of single-page PDFs.
Multi-page pdfs require proportionally more time. If your files vary in
size, calibrate the wait time to the largest file, or you may get 429s
(HTTP code for “rate limit exceeded”) half way through the
iteration.
Although this procedure is relatively slow, it need not add much to the overall processing time. DAI starts processing the first files it receives right away, so when your loop ends, DAI will be mostly done with the OCR as well.
If you have long PDFs, DAI will break the output into shards, meaning
that, for a single PDF file, you may get back multiple JSON files named
*-1.json
, *-2.json
, etc.
To weave the text back together again, you can use
daiR
’s merge_shards()
function. It works on
.txt
files, not JSON files, so you need to extract the text
from the JSONs first. You also need to keep the name stem – turning
document-1.json
into document-1.txt
and so
forth – so that merge_shards()
knows which pieces belong
together. This is the default behaviour of get_text()
, so
as long as you don’t touch the outfile_stem
parameter, you
should be fine.
Here is a sample workflow: