Prior to streaming, make sure to install and load rtweet. This
vignette assumes users have already setup app access tokens (see: the
“auth” vignette, vignette("auth", package = "rtweet")
).
rtweet makes it possible to capture live streams of Twitter data1.
There are two ways of having a stream:
A
stream collecting data from a set of rules, which can be collected
via filtered_stream()
.
A
stream of a 1% of tweets published, which can be collected via
sample_stream()
.
In either case we need to choose how long should the streaming connection hold, and in which file it should be saved to.
## Stream time in seconds so for one minute set timeout = 60
## For larger chunks of time, I recommend multiplying 60 by the number
## of desired minutes. This method scales up to hours as well
## (x * 60 = x mins, x * 60 * 60 = x hours)
## Stream for 5 seconds
streamtime <- 5
## Filename to save json data (backup)
filename <- "rstats.json"
The filtered stream collects tweets for all rules that are currently active, not just one rule or query.
Streaming rules in rtweet need a value and a tag. The value is the query to be performed, and the tag is the name to identify tweets that match a query. You can use multiple words and hashtags as value, please read the official documentation. Multiple rules can match to a single tweet.
To know current rules you can use stream_add_rule()
to
know if any rule is currently active:
With the help of rules()
the id, value and tag of each
rule is provided.
To remove rules use stream_rm_rule()
Note, if the rules are not used for some time, Twitter warns you that
they will be removed. But given that filtered_stream()
collects tweets for all rules, it is advisable to keep the rules list
short and clean.
Once these parameters are specified, initiate the stream. Note: Barring any disconnection or disruption of the API, streaming will occupy your current instance of R until the specified time has elapsed. It is possible to start a new instance or R —streaming itself usually isn’t very memory intensive— but operations may drag a bit during the parsing process which takes place immediately after streaming ends.
If no tweet matching the rules is detected a warning will be issued.
Parsing larger streams can take quite a bit of time (in addition to time spent streaming) due to a somewhat time-consuming simplifying process used to convert a json file into an R object.
Don’t forget to clean the streaming rules:
The sample_stream()
function doesn’t need rules or
anything.
Users may want to stream tweets into json files upfront and parse
those files later on. To do this, simply add parse = FALSE
and make sure you provide a path (file name) to a location you can find
later.
You can also use append = TRUE
to continue recording a
stream into an already existing file.
Currently parsing the streaming data file with
parse_stream()
is not functional. However, you can read it
back in with jsonlite::stream_in(file)
.
The parsed object should be the same whether a user parses up-front or from a json file in a later session.
Currently the returned object is a raw conversion of the feed into a nested list depending on the fields and extensions requested.
Till November 2022 it was possible with API v1.1, currently this is no longer possible and uses API v2.↩︎