Title: | Fast Tidy Tools for Date and Date-Time Manipulation |
---|---|
Description: | A set of fast tidy functions for wrangling, completing and summarising date and date-time data. It combines 'tidyverse' syntax with the efficiency of 'data.table' and speed of 'collapse'. |
Authors: | Nick Christofides [aut, cre] |
Maintainer: | Nick Christofides <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.8.2 |
Built: | 2025-01-16 10:24:40 UTC |
Source: | https://github.com/cranhaven/cranhaven.r-universe.dev |
A framework for handling raw date & datetime data
using tidy best-practices from the tidyverse, the efficiency of data.table,
and the speed of collapse.
You can learn more about the tidyverse, data.table and collapse using the links below
Maintainer: Nick Christofides [email protected] (ORCID)
Useful links:
Report bugs at https://github.com/NicChr/timeplyr/issues
Time units
.time_units .period_units .duration_units .extra_time_units
.time_units .period_units .duration_units .extra_time_units
An object of class character
of length 21.
An object of class character
of length 7.
An object of class character
of length 11.
An object of class character
of length 10.
Correct calculation of ages in years using lubridate periods. Leap year calculations work as well.
age_years(start, end = if (is_date(start)) Sys.Date() else Sys.time()) age_months(start, end = if (is_date(start)) Sys.Date() else Sys.time())
age_years(start, end = if (is_date(start)) Sys.Date() else Sys.time()) age_months(start, end = if (is_date(start)) Sys.Date() else Sys.time())
start |
Start date/datetime, typically date of birth. |
end |
End date/datetime. Default is current date/datetime. |
Integer vector of age in years or months.
An alternative to dplyr::desc()
which is much faster
for character vectors and factors.
asc(x) desc(x)
asc(x) desc(x)
x |
Vector. |
A numeric vector that can be ordered in ascending or descending order.
Useful in dplyr::arrange()
or farrange()
.
library(dplyr) library(timeplyr) starwars %>% fdistinct(mass) %>% farrange(desc(mass))
library(dplyr) library(timeplyr) starwars %>% fdistinct(mass) %>% farrange(desc(mass))
Create a table of common time units from a date or datetime sequence.
calendar( x, label = TRUE, week_start = getOption("lubridate.week.start", 1), fiscal_start = getOption("lubridate.fiscal.start", 1), name = "time" )
calendar( x, label = TRUE, week_start = getOption("lubridate.week.start", 1), fiscal_start = getOption("lubridate.fiscal.start", 1), name = "time" )
x |
date or datetime vector. |
label |
Logical. Should labelled (ordered factor) versions of
week day and month be returned? Default is |
week_start |
day on which week starts following ISO conventions - 1
means Monday, 7 means Sunday (default). When |
fiscal_start |
Numeric indicating the starting month of a fiscal year. |
name |
Name of date/datetime column. |
An object of class tibble
.
library(timeplyr) library(lubridate) # Create a calendar for the current year from <- floor_date(today(), unit = "year") to <- ceiling_date(today(), unit = "year", change_on_boundary = TRUE) - days(1) my_seq <- time_seq(from, to, time_by = "day") calendar(my_seq)
library(timeplyr) library(lubridate) # Create a calendar for the current year from <- floor_date(today(), unit = "year") to <- ceiling_date(today(), unit = "year", change_on_boundary = TRUE) - days(1) my_seq <- time_seq(from, to, time_by = "day") calendar(my_seq)
do.call()
and data.table::CJ()
methodThis function operates like do.call(CJ, ...)
and accepts
a list or data.frame as an argument.
It has less overhead for small joins, especially when unique = FALSE
and
as_dt = FALSE
. NA
s are by default sorted last.
crossed_join( X, sort = FALSE, unique = TRUE, as_dt = TRUE, strings_as_factors = FALSE )
crossed_join( X, sort = FALSE, unique = TRUE, as_dt = TRUE, strings_as_factors = FALSE )
X |
A list or data frame. |
sort |
Should the expansion be sorted? By default it is |
unique |
Should unique values across each column or list element
be taken? By default this is |
as_dt |
Should result be a |
strings_as_factors |
Should strings be converted to factors before
expansion? The default is |
An important note is that currently NA
s
are sorted last and therefore a key is not set.
A data.table or list object.
library(timeplyr) crossed_join(list(1:3, -2:2)) crossed_join(iris)
library(timeplyr) crossed_join(list(1:3, -2:2)) crossed_join(iris)
Find duplicate rows
duplicate_rows( data, ..., .keep_all = FALSE, .both_ways = FALSE, .add_count = FALSE, .drop_empty = FALSE, sort = FALSE, .by = NULL, .cols = NULL )
duplicate_rows( data, ..., .keep_all = FALSE, .both_ways = FALSE, .add_count = FALSE, .drop_empty = FALSE, sort = FALSE, .by = NULL, .cols = NULL )
data |
A data frame. |
... |
Variables used to find duplicate rows. |
.keep_all |
If |
.both_ways |
If |
.add_count |
If |
.drop_empty |
If |
sort |
Should result be sorted?
If |
.by |
(Optional). A selection of columns to group by for this operation. Columns are specified using tidy-select. |
.cols |
(Optional) alternative to |
This function works like dplyr::distinct()
in its handling of
arguments and data-masking but returns duplicate rows.
In certain situations in can be much faster than data %>% group_by() %>% filter(n() > 1)
when there are many groups.
fduplicates2()
returns the same output but uses a different
method which utilises joins and is written almost entirely using dplyr.
A data.frame
of duplicate rows.
fcount group_collapse fdistinct
library(dplyr) library(timeplyr) library(ggplot2) # Duplicates across all columns diamonds %>% duplicate_rows() # Alternatively with row ids diamonds %>% filter(frowid(.) > 1) # Diamonds with the same dimensions diamonds %>% duplicate_rows(x, y, z) # Can use tidyverse select notation diamonds %>% duplicate_rows(across(where(is.factor)), .keep_all = FALSE) # Similar to janitor::get_dupes() diamonds %>% duplicate_rows(.add_count = TRUE) # Keep the first instance of each duplicate row diamonds %>% duplicate_rows(.both_ways = TRUE) # Same as the below diamonds %>% fadd_count(across(everything())) %>% filter(n > 1)
library(dplyr) library(timeplyr) library(ggplot2) # Duplicates across all columns diamonds %>% duplicate_rows() # Alternatively with row ids diamonds %>% filter(frowid(.) > 1) # Diamonds with the same dimensions diamonds %>% duplicate_rows(x, y, z) # Can use tidyverse select notation diamonds %>% duplicate_rows(across(where(is.factor)), .keep_all = FALSE) # Similar to janitor::get_dupes() diamonds %>% duplicate_rows(.add_count = TRUE) # Keep the first instance of each duplicate row diamonds %>% duplicate_rows(.both_ways = TRUE) # Same as the below diamonds %>% fadd_count(across(everything())) %>% filter(n > 1)
Like dplyr::cume_dist(x)
and ecdf(x)(x)
but with added grouping and weighting functionality.
You can calculate the empirical distribution of x using
aggregated data by supplying frequency weights.
No expansion occurs which makes this function extremely efficient
for this type of data, of which plotting is a common application.
edf(x, g = NULL, wt = NULL)
edf(x, g = NULL, wt = NULL)
x |
Numeric vector. |
g |
Numeric vector of group IDs. |
wt |
Frequency weights. |
A numeric vector the same length as x
.
library(timeplyr) library(dplyr) library(ggplot2) set.seed(9123812) x <- sample(seq(-10, 10, 0.5), size = 10^2, replace = TRUE) plot(sort(edf(x))) all.equal(edf(x), ecdf(x)(x)) all.equal(edf(x), cume_dist(x)) # Manual ECDF plot using only aggregate data y <- rnorm(100, 10) start <- floor(min(y) / 0.1) * 0.1 grid <- time_span(y, time_by = 0.1, from = start) counts <- time_countv(y, time_by = 0.1, from = start, complete = TRUE)$n edf <- edf(grid, wt = counts) # Trivial here as this is the same all.equal(unname(cumsum(counts)/sum(counts)), edf) # Full ecdf tibble(x) %>% ggplot(aes(x = y)) + stat_ecdf() # Approximation using aggregate only data tibble(grid, edf) %>% ggplot(aes(x = grid, y = edf)) + geom_step() # Grouped example g <- sample(letters[1:3], size = 10^2, replace = TRUE) edf1 <- tibble(x, g) %>% mutate(edf = cume_dist(x), .by = g) %>% pull(edf) edf2 <- edf(x, g = g) all.equal(edf1, edf2)
library(timeplyr) library(dplyr) library(ggplot2) set.seed(9123812) x <- sample(seq(-10, 10, 0.5), size = 10^2, replace = TRUE) plot(sort(edf(x))) all.equal(edf(x), ecdf(x)(x)) all.equal(edf(x), cume_dist(x)) # Manual ECDF plot using only aggregate data y <- rnorm(100, 10) start <- floor(min(y) / 0.1) * 0.1 grid <- time_span(y, time_by = 0.1, from = start) counts <- time_countv(y, time_by = 0.1, from = start, complete = TRUE)$n edf <- edf(grid, wt = counts) # Trivial here as this is the same all.equal(unname(cumsum(counts)/sum(counts)), edf) # Full ecdf tibble(x) %>% ggplot(aes(x = y)) + stat_ecdf() # Approximation using aggregate only data tibble(grid, edf) %>% ggplot(aes(x = grid, y = edf)) + geom_step() # Grouped example g <- sample(letters[1:3], size = 10^2, replace = TRUE) edf1 <- tibble(x, g) %>% mutate(edf = cume_dist(x), .by = g) %>% pull(edf) edf2 <- edf(x, g = g) all.equal(edf1, edf2)
collapse
version of dplyr::arrange()
This is a fast and near-identical alternative to dplyr::arrange()
using the collapse
package.
desc()
is like dplyr::desc()
but works faster when
called directly on vectors.
farrange(data, ..., .by = NULL, .by_group = FALSE, .cols = NULL)
farrange(data, ..., .by = NULL, .by_group = FALSE, .cols = NULL)
data |
A data frame. |
... |
Variables to arrange by. |
.by |
(Optional). A selection of columns to group by for this operation.
Columns are specified using |
.by_group |
If |
.cols |
(Optional) alternative to |
farrange()
is inspired by collapse::roworder()
but also supports
dplyr
style data-masking
which makes it a
closer replacement to dplyr::arrange()
.
You can use desc()
interchangeably with dplyr
and timeplyr
. arrange(iris, desc(Species))
uses dplyr
's version. farrange(iris, desc(Species))
uses timeplyr
's version.
farrange()
is faster when there are many groups or a large number of
rows.
A sorted data.frame
.
Near-identical alternative to dplyr::count()
.
fcount( data, ..., wt = NULL, sort = FALSE, order = df_group_by_order_default(data), name = NULL, .by = NULL, .cols = NULL ) fadd_count( data, ..., wt = NULL, sort = FALSE, order = df_group_by_order_default(data), name = NULL, .by = NULL, .cols = NULL )
fcount( data, ..., wt = NULL, sort = FALSE, order = df_group_by_order_default(data), name = NULL, .by = NULL, .cols = NULL ) fadd_count( data, ..., wt = NULL, sort = FALSE, order = df_group_by_order_default(data), name = NULL, .by = NULL, .cols = NULL )
data |
A data frame. |
... |
Variables to group by. |
wt |
Frequency weights.
Can be
|
sort |
If |
order |
Should the groups be calculated as ordered groups?
If |
name |
The name of the new column in the output.
If there's already a column called |
.by |
(Optional). A selection of columns to group by for this operation. Columns are specified using tidy-select. |
.cols |
(Optional) alternative to |
This is a fast and near-identical alternative to dplyr::count() using the collapse
package.
Unlike collapse::fcount()
, this works very similarly to dplyr::count()
.
The only main difference is that anything supplied to wt
is recycled and added as a data variable.
Other than that everything works exactly as the dplyr equivalent.
fcount()
and fadd_count()
can be up to >100x faster than the dplyr equivalents.
A data.frame
of frequency counts by group.
library(timeplyr) library(dplyr) iris %>% fcount() iris %>% fadd_count(name = "count") %>% fslice_head(n = 10) iris %>% group_by(Species) %>% fcount() iris %>% fcount(Species) iris %>% fcount(across(where(is.numeric), mean)) ### Sorting behaviour # Sorted by group starwars %>% fcount(hair_color) # Sorted by frequency starwars %>% fcount(hair_color, sort = TRUE) # Groups sorted by order of first appearance (faster) starwars %>% fcount(hair_color, order = FALSE)
library(timeplyr) library(dplyr) iris %>% fcount() iris %>% fadd_count(name = "count") %>% fslice_head(n = 10) iris %>% group_by(Species) %>% fcount() iris %>% fcount(Species) iris %>% fcount(across(where(is.numeric), mean)) ### Sorting behaviour # Sorted by group starwars %>% fcount(hair_color) # Sorted by frequency starwars %>% fcount(hair_color, sort = TRUE) # Groups sorted by order of first appearance (faster) starwars %>% fcount(hair_color, order = FALSE)
Like dplyr::distinct()
but faster when lots of
groups are involved.
fdistinct( data, ..., .keep_all = FALSE, sort = FALSE, order = sort, .by = NULL, .cols = NULL )
fdistinct( data, ..., .keep_all = FALSE, sort = FALSE, order = sort, .by = NULL, .cols = NULL )
data |
A data frame. |
... |
Variables used to find distinct rows. |
.keep_all |
If |
sort |
Should result be sorted? Default is |
order |
Should the groups be calculated as ordered groups?
Setting to |
.by |
(Optional). A selection of columns to group by for this operation. Columns are specified using tidy-select. |
.cols |
(Optional) alternative to |
A data.frame
of distinct groups.
library(dplyr) library(timeplyr) library(ggplot2) mpg %>% distinct(manufacturer) mpg %>% fdistinct(manufacturer)
library(dplyr) library(timeplyr) library(ggplot2) mpg %>% distinct(manufacturer) mpg %>% fdistinct(manufacturer)
tidyr::expand()
and tidyr::complete()
.Fast versions of tidyr::expand()
and tidyr::complete()
.
fexpand( data, ..., expand_type = c("crossing", "nesting"), sort = FALSE, .by = NULL ) fcomplete( data, ..., expand_type = c("crossing", "nesting"), sort = FALSE, .by = NULL, fill = NA )
fexpand( data, ..., expand_type = c("crossing", "nesting"), sort = FALSE, .by = NULL ) fcomplete( data, ..., expand_type = c("crossing", "nesting"), sort = FALSE, .by = NULL, fill = NA )
data |
A data frame |
... |
Variables to expand |
expand_type |
Type of expansion to use where "nesting"
finds combinations already present in the data
(exactly the same as using |
sort |
Logical. If |
.by |
(Optional). A selection of columns to group by for this operation. Columns are specified using tidy-select. |
fill |
A named list containing value-name pairs to fill the named implicit missing values. |
For un-grouped data fexpand()
is similar in speed to tidyr::expand()
.
When the data contain many groups, fexpand()
is much much faster (see examples).
The 2 main differences between fexpand()
and tidyr::expand()
are that:
tidyr style helpers like nesting()
and crossing()
are ignored.
The type of expansion used is controlled through expand_type
and applies to
all supplied variables.
Expressions are first calculated on the entire ungrouped dataset before being
expanded but within-group expansions will work on variables that already exist
in the dataset.
For example, iris %>% group_by(Species) %>% fexpand(Sepal.Length, Sepal.Width)
will perform a grouped expansion but
iris %>% group_by(Species) %>% fexpand(range(Sepal.Length))
will not.
For efficiency, when supplying groups, expansion is done on a by-group basis only if
there are 2 or more variables that aren't part of the grouping.
The reason is that a by-group calculation does not need to be done with 1 expansion variable
as all combinations across groups already exist against that 1 variable.
When expand_type = "nesting"
groups are ignored for speed purposes as the result is the same.
An advantage of fexpand()
is that it returns a data frame with the same class
as the input. It also uses data.table
for memory efficiency and collapse
for speed.
A future development for fcomplete()
would be to only fill values of variables that
correspond only to both additional completed rows and rows that match the expanded rows, are
filled in. For example,
iris %>% mutate(test = NA_real_) %>% complete(Sepal.Length = 0:100, fill = list(test = 0))
fills in all NA
values of test, whereas
iris %>% mutate(test = NA_real_) %>% fcomplete(Sepal.Length = 0:100, fill = list(test = 0))
should only fill in values of test that correspond to Sepal.Length values of 0:100
.
An additional note to add when expand_type = "nesting"
is that if one of the
supplied variables in ...
does not exist in the data, but can be recycled
to the length of the data, then it is added and treated as a data variable.
A data.frame
of expanded groups.
library(timeplyr) library(dplyr) library(lubridate) library(nycflights13) flights %>% fexpand(origin, dest) flights %>% fexpand(origin, dest, sort = FALSE) # Grouped expansions example # 1 extra group (carrier) this is very quick flights %>% group_by(origin, dest, tailnum) %>% fexpand(carrier)
library(timeplyr) library(dplyr) library(lubridate) library(nycflights13) flights %>% fexpand(origin, dest) flights %>% fexpand(origin, dest, sort = FALSE) # Grouped expansions example # 1 extra group (carrier) this is very quick flights %>% group_by(origin, dest, tailnum) %>% fexpand(carrier)
dplyr::group_by()
This works the exact same as dplyr::group_by()
and typically
performs around the same speed but uses slightly less memory.
fgroup_by( data, ..., .add = FALSE, order = df_group_by_order_default(data), .by = NULL, .cols = NULL, .drop = df_group_by_drop_default(data) )
fgroup_by( data, ..., .add = FALSE, order = df_group_by_order_default(data), .by = NULL, .cols = NULL, .drop = df_group_by_drop_default(data) )
data |
data frame. |
... |
Variables to group by. |
.add |
Should groups be added to existing groups?
Default is |
order |
Should groups be ordered? If |
.by |
(Optional). A selection of columns to group by for this operation.
Columns are specified using |
.cols |
(Optional) alternative to |
.drop |
Should unused factor levels be dropped? Default is |
fgroup_by()
works almost exactly like the 'dplyr' equivalent.
An attribute "sorted" (TRUE
or FALSE
) is added to the group data to
signify if the groups are sorted or not.
A grouped_df
.
Very fast row numbers by group.
frowid(x, ascending = TRUE)
frowid(x, ascending = TRUE)
x |
A vector, data frame or |
ascending |
When |
frowid()
is like data.table::rowid()
but uses
an alternative method for calculating row numbers.
When x
is a collapse GRP
object, it is considerably faster.
It is also faster for character vectors.
An integer vector of row IDs.
library(timeplyr) library(dplyr) library(data.table) library(nycflights13) # Simple row numbers head(row_id(flights)) # Row numbers by origin head(frowid(flights$origin)) head(row_id(flights, origin)) # Fast duplicate rows head(frowid(flights) > 1) # With data frames, better to use row_id() flights %>% add_row_id() %>% # Plain row ids add_row_id(origin, dest, .name = "grouped_row_id") # Row IDs by group
library(timeplyr) library(dplyr) library(data.table) library(nycflights13) # Simple row numbers head(row_id(flights)) # Row numbers by origin head(frowid(flights$origin)) head(row_id(flights, origin)) # Fast duplicate rows head(frowid(flights) > 1) # With data frames, better to use row_id() flights %>% add_row_id() %>% # Plain row ids add_row_id(origin, dest, .name = "grouped_row_id") # Row IDs by group
dplyr::select()
/dplyr::rename()
fselect()
operates the exact same way as dplyr::select()
and
can be used naturally with tidy-select
helpers.
It uses collapse to perform the actual selecting of variables and is
considerably faster than dplyr for selecting exact columns,
and even more so when supplying the .cols
argument.
fselect(data, ..., .cols = NULL) frename(data, ..., .cols = NULL)
fselect(data, ..., .cols = NULL) frename(data, ..., .cols = NULL)
data |
A data frame. |
... |
Variables to select using |
.cols |
(Optional) faster alternative to |
A data.frame
of selected columns.
library(timeplyr) library(dplyr) df <- slice_head(iris, n = 5) fselect(df, Species, SL = Sepal.Length) fselect(df, .cols = c("Species", "Sepal.Length")) fselect(df, all_of(c("Species", "Sepal.Length"))) fselect(df, 5, 1) fselect(df, .cols = c(5, 1)) df %>% fselect(where(is.numeric))
library(timeplyr) library(dplyr) df <- slice_head(iris, n = 5) fselect(df, Species, SL = Sepal.Length) fselect(df, .cols = c("Species", "Sepal.Length")) fselect(df, all_of(c("Species", "Sepal.Length"))) fselect(df, 5, 1) fselect(df, .cols = c(5, 1)) df %>% fselect(where(is.numeric))
dplyr::slice()
When there are lots of groups, the fslice()
functions are much faster.
fslice(data, ..., .by = NULL, keep_order = FALSE, sort_groups = TRUE) fslice_head( data, ..., n, prop, .by = NULL, keep_order = FALSE, sort_groups = TRUE ) fslice_tail( data, ..., n, prop, .by = NULL, keep_order = FALSE, sort_groups = TRUE ) fslice_min( data, order_by, ..., n, prop, .by = NULL, with_ties = TRUE, na_rm = FALSE, keep_order = FALSE, sort_groups = TRUE ) fslice_max( data, order_by, ..., n, prop, .by = NULL, with_ties = TRUE, na_rm = FALSE, keep_order = FALSE, sort_groups = TRUE ) fslice_sample( data, n, replace = FALSE, prop, .by = NULL, keep_order = FALSE, sort_groups = TRUE, weights = NULL, seed = NULL )
fslice(data, ..., .by = NULL, keep_order = FALSE, sort_groups = TRUE) fslice_head( data, ..., n, prop, .by = NULL, keep_order = FALSE, sort_groups = TRUE ) fslice_tail( data, ..., n, prop, .by = NULL, keep_order = FALSE, sort_groups = TRUE ) fslice_min( data, order_by, ..., n, prop, .by = NULL, with_ties = TRUE, na_rm = FALSE, keep_order = FALSE, sort_groups = TRUE ) fslice_max( data, order_by, ..., n, prop, .by = NULL, with_ties = TRUE, na_rm = FALSE, keep_order = FALSE, sort_groups = TRUE ) fslice_sample( data, n, replace = FALSE, prop, .by = NULL, keep_order = FALSE, sort_groups = TRUE, weights = NULL, seed = NULL )
data |
Data frame |
... |
See |
.by |
(Optional). A selection of columns to group by for this operation. Columns are specified using tidy-select. |
keep_order |
Should the sliced data frame be returned in its original order?
The default is |
sort_groups |
If |
n |
Number of rows. |
prop |
Proportion of rows. |
order_by |
Variables to order by. |
with_ties |
Should ties be kept together? The default is |
na_rm |
Should missing values in |
replace |
Should |
weights |
Probability weights used in |
seed |
Seed number defining RNG state.
If supplied, this is only applied locally within the function
and the seed state isn't retained after sampling.
To clarify, whatever seed state was in place before the function call,
is restored to ensure seed continuity.
If left |
fslice()
and friends allow for more flexibility in how you order the by-group slicing.
Furthermore, you can control whether the returned data frame is sliced in
the order of the supplied row indices, or whether the
original order is retained (like dplyr::filter()
).
In fslice()
, when length(n) == 1
, an optimised method is implemented
that internally uses list_subset()
, a fast function for extracting
single elements from single-level lists that contain vectors of the same
type, e.g. integer.
fslice_head()
and fslice_tail()
are very fast with large numbers of groups.
fslice_sample()
is arguably more intuitive as it by default
resamples each entire group without replacement, without having to specify a
maximum group size like in dplyr::slice_sample()
.
A data.frame
of specified rows.
library(timeplyr) library(dplyr) library(nycflights13) flights <- flights %>% group_by(origin, dest) # First row repeated for each group flights %>% fslice(1, 1) # First row per group flights %>% fslice_head(n = 1) # Last row per group flights %>% fslice_tail(n = 1) # Earliest flight per group flights %>% fslice_min(time_hour, with_ties = FALSE) # Last flight per group flights %>% fslice_max(time_hour, with_ties = FALSE) # Random sample without replacement by group # (or stratified random sampling) flights %>% fslice_sample()
library(timeplyr) library(dplyr) library(nycflights13) flights <- flights %>% group_by(origin, dest) # First row repeated for each group flights %>% fslice(1, 1) # First row per group flights %>% fslice_head(n = 1) # Last row per group flights %>% fslice_tail(n = 1) # Earliest flight per group flights %>% fslice_min(time_hour, with_ties = FALSE) # Last flight per group flights %>% fslice_max(time_hour, with_ties = FALSE) # Random sample without replacement by group # (or stratified random sampling) flights %>% fslice_sample()
The output is a list
containing summary statistics of time delay between two date/datetime vectors.
This can be especially useful in estimating reporting delay for example.
data - A data frame containing the origin, end and calculated time delay.
unit - The chosen time unit.
num - The number of time units.
summary - tibble
with summary statistics.
delay - tibble
containing the empirical cumulative distribution function
values by time delay.
plot - A ggplot
of the time delay distribution.
get_time_delay( data, origin, end, time_by = 1L, time_type = getOption("timeplyr.time_type", "auto"), min_delay = -Inf, max_delay = Inf, probs = c(0.25, 0.5, 0.75, 0.95), .by = NULL, include_plot = TRUE, x_scales = "fixed", bw = "sj", ... )
get_time_delay( data, origin, end, time_by = 1L, time_type = getOption("timeplyr.time_type", "auto"), min_delay = -Inf, max_delay = Inf, probs = c(0.25, 0.5, 0.75, 0.95), .by = NULL, include_plot = TRUE, x_scales = "fixed", bw = "sj", ... )
data |
A data frame. |
origin |
Origin date variable. |
end |
End date variable. |
time_by |
Must be one of the three:
|
time_type |
If "auto", |
min_delay |
The minimum acceptable delay,
all delays less than this are removed before calculation.
Default is |
max_delay |
The maximum acceptable delay,
all delays greater than this are removed before calculation.
Default is |
probs |
Probabilities used in the quantile summary.
Default is |
.by |
(Optional). A selection of columns to group by for this operation. Columns are specified using tidy-select. |
include_plot |
Should a |
x_scales |
Option to control how the x-axis is displayed for multiple facets. Choices are "fixed" or "free_x". |
bw |
The smoothing bandwidth selector for the Kernel Density estimator.
If numeric, the standard deviation of the smoothing kernel.
If character, a rule to choose the bandwidth. See |
... |
Further arguments to be passed on to |
A list containing summary data, summary statistics and an optional ggplot
.
library(timeplyr) library(outbreaks) library(dplyr) ebola_linelist <- ebola_sim_clean$linelist # Incubation period distribution # 95% of individuals experienced an incubation period of <= 26 days inc_distr_days <- ebola_linelist %>% get_time_delay(date_of_infection, date_of_onset, time_by = "days") head(inc_distr_days$data) inc_distr_days$unit inc_distr_days$num inc_distr_days$summary head(inc_distr_days$delay) # ECDF and freq by delay inc_distr_days$plot # Can change bandwidth selector inc_distr_days <- ebola_linelist %>% get_time_delay(date_of_infection, date_of_onset, time_by = "day", bw = "nrd") inc_distr_days$plot # Can choose any time units inc_distr_weeks <- ebola_linelist %>% get_time_delay(date_of_infection, date_of_onset, time_by = "weeks", bw = "nrd") inc_distr_weeks$plot
library(timeplyr) library(outbreaks) library(dplyr) ebola_linelist <- ebola_sim_clean$linelist # Incubation period distribution # 95% of individuals experienced an incubation period of <= 26 days inc_distr_days <- ebola_linelist %>% get_time_delay(date_of_infection, date_of_onset, time_by = "days") head(inc_distr_days$data) inc_distr_days$unit inc_distr_days$num inc_distr_days$summary head(inc_distr_days$delay) # ECDF and freq by delay inc_distr_days$plot # Can change bandwidth selector inc_distr_days <- ebola_linelist %>% get_time_delay(date_of_infection, date_of_onset, time_by = "day", bw = "nrd") inc_distr_days$plot # Can choose any time units inc_distr_weeks <- ebola_linelist %>% get_time_delay(date_of_infection, date_of_onset, time_by = "weeks", bw = "nrd") inc_distr_weeks$plot
Key group information
group_collapse( data, ..., order = TRUE, sort = FALSE, ascending = TRUE, .by = NULL, .cols = NULL, id = TRUE, size = TRUE, loc = TRUE, start = TRUE, end = TRUE, .drop = df_group_by_drop_default(data) )
group_collapse( data, ..., order = TRUE, sort = FALSE, ascending = TRUE, .by = NULL, .cols = NULL, id = TRUE, size = TRUE, loc = TRUE, start = TRUE, end = TRUE, .drop = df_group_by_drop_default(data) )
data |
A data frame or vector. |
... |
Additional groups using tidy |
order |
Should the groups be ordered?
THE PHYSICAL ORDER OF THE DATA IS NOT CHANGED. |
sort |
Should the data frame be sorted by the groups? |
ascending |
Should groups be ordered in ascending order?
Default is |
.by |
Alternative way of supplying groups using |
.cols |
(Optional) alternative to |
id |
Should group IDs be added? Default is |
size |
Should group sizes be added? Default is |
loc |
Should group locations be added? Default is |
start |
Should group start locations be added? Default is |
end |
Should group end locations be added? Default is |
.drop |
Should unused factor levels be dropped? Default is |
group_collapse()
is similar to dplyr::group_data()
but differs in 3 key regards:
The output tries to convey as much information about the groups as possible.
By default, like dplyr
, the groups are ordered, but unlike dplyr
they are not
sorted, which conveys information on order-of-first-appearance in the data.
In addition to group locations, group sizes and start indices are returned.
There is more flexibility in specifying how the groups are ordered and/or sorted.
collapse
is used to obtain the grouping structure, which is very fast.
There are 3 ways to specify the groups:
Using ...
which utilises tidy
data-masking
.
Using .by
which utilises tidyselect
.
Using .cols
which accepts a named character/integer vector.
A tibble
of unique groups and an integer ID uniquely identifying each group.
library(timeplyr) library(dplyr) iris <- dplyr::as_tibble(iris) group_collapse(iris) # No groups group_collapse(iris, Species) # Species groups iris %>% group_by(Species) %>% group_collapse() # Same thing # Group entire data frame group_collapse(iris, .by = everything())
library(timeplyr) library(dplyr) iris <- dplyr::as_tibble(iris) group_collapse(iris) # No groups group_collapse(iris, Species) # Species groups iris %>% group_by(Species) %>% group_collapse() # Same thing # Group entire data frame group_collapse(iris, .by = everything())
These are tidy-based functions for calculating group IDs, row IDs and
group orders.
group_id()
returns an integer vector of group IDs the same size as the data.
row_id()
returns an integer vector of row IDs.
group_order()
returns the order of the groups.
The add_
variants add a column of group IDs/row IDs/group orders.
group_id( data, ..., order = TRUE, ascending = TRUE, .by = NULL, .cols = NULL, as_qg = FALSE ) add_group_id( data, ..., order = TRUE, ascending = TRUE, .by = NULL, .cols = NULL, .name = NULL, as_qg = FALSE ) row_id(data, ..., ascending = TRUE, .by = NULL, .cols = NULL) ## S3 method for class 'GRP' row_id(data, ascending = TRUE, ...) add_row_id(data, ..., ascending = TRUE, .by = NULL, .cols = NULL, .name = NULL) group_order(data, ..., ascending = TRUE, .by = NULL, .cols = NULL) add_group_order( data, ..., ascending = TRUE, .by = NULL, .cols = NULL, .name = NULL )
group_id( data, ..., order = TRUE, ascending = TRUE, .by = NULL, .cols = NULL, as_qg = FALSE ) add_group_id( data, ..., order = TRUE, ascending = TRUE, .by = NULL, .cols = NULL, .name = NULL, as_qg = FALSE ) row_id(data, ..., ascending = TRUE, .by = NULL, .cols = NULL) ## S3 method for class 'GRP' row_id(data, ascending = TRUE, ...) add_row_id(data, ..., ascending = TRUE, .by = NULL, .cols = NULL, .name = NULL) group_order(data, ..., ascending = TRUE, .by = NULL, .cols = NULL) add_group_order( data, ..., ascending = TRUE, .by = NULL, .cols = NULL, .name = NULL )
data |
A data frame or vector. |
... |
Additional groups using tidy |
order |
Should the groups be ordered?
THE PHYSICAL ORDER OF THE DATA IS NOT CHANGED. identical(order(x, na.last = TRUE), order(group_id(x, order = TRUE))) or in the case of a data frame identical(order(x1, x2, x3, na.last = TRUE), order(group_id(data, x1, x2, x3, order = TRUE))) should always hold. |
ascending |
Should the group order be ascending or descending?
The default is |
.by |
Alternative way of supplying groups using |
.cols |
(Optional) alternative to |
as_qg |
Should the group IDs be returned as a
collapse "qG" class? The default ( |
.name |
Name of the added ID column which should be a
character vector of length 1.
If |
It's important to note for data frames, these functions by default assume no groups unless you supply them.
This means that when no groups are supplied:
group_id(iris)
returns a vector of ones
row_id(iris)
returns the plain row id numbers
group_order(iris) == row_id(iris)
.
One can specify groups in the second argument like so:
group_id(iris, Species)
row_id(iris, across(all_of("Species")))
group_order(iris, across(where(is.numeric), desc))
If you want group_id
to always use all the columns of a data frame
for grouping
while simultaneously utilising the group_id
methods, one can use the below
function.
group_id2 <- function(data, ...){ group_id(data, ..., .cols = names(data)) }
An integer vector.
library(timeplyr) library(dplyr) library(ggplot2) group_id(iris) # No groups group_id(iris, Species) # Species groups row_id(iris) # Plain row IDs row_id(iris, Species) # Row IDs by group # Order of Species + descending Petal.Width group_order(iris, Species, desc(Petal.Width)) # Same as order(iris$Species, -xtfrm(iris$Petal.Width)) # Tidy data-masking/tidyselect can be used group_id(iris, across(where(is.numeric))) # Groups across numeric values # Alternatively using tidyselect group_id(iris, .by = where(is.numeric)) # Group IDs using a mixtured order group_id(iris, desc(Species), Sepal.Length, desc(Petal.Width)) # add_ helpers iris %>% distinct(Species) %>% add_group_id(Species) iris %>% add_row_id(Species) %>% pull(row_id) # Usage in data.table library(data.table) iris_dt <- as.data.table(iris) iris_dt[, group_id := group_id(.SD, .cols = names(.SD)), .SDcols = "Species"] # Or if you're using this often you can write a wrapper set_add_group_id <- function(x, ..., .name = "group_id"){ id <- group_id(x, ...) data.table::set(x, j = .name, value = id) } set_add_group_id(iris_dt, desc(Species))[] mm_mpg <- mpg %>% select(manufacturer, model) %>% arrange(desc(pick(everything()))) # Sorted/non-sorted groups mm_mpg %>% add_group_id(across(everything()), .name = "sorted_id", order = TRUE) %>% add_group_id(manufacturer, model, .name = "not_sorted_id", order = FALSE) %>% distinct()
library(timeplyr) library(dplyr) library(ggplot2) group_id(iris) # No groups group_id(iris, Species) # Species groups row_id(iris) # Plain row IDs row_id(iris, Species) # Row IDs by group # Order of Species + descending Petal.Width group_order(iris, Species, desc(Petal.Width)) # Same as order(iris$Species, -xtfrm(iris$Petal.Width)) # Tidy data-masking/tidyselect can be used group_id(iris, across(where(is.numeric))) # Groups across numeric values # Alternatively using tidyselect group_id(iris, .by = where(is.numeric)) # Group IDs using a mixtured order group_id(iris, desc(Species), Sepal.Length, desc(Petal.Width)) # add_ helpers iris %>% distinct(Species) %>% add_group_id(Species) iris %>% add_row_id(Species) %>% pull(row_id) # Usage in data.table library(data.table) iris_dt <- as.data.table(iris) iris_dt[, group_id := group_id(.SD, .cols = names(.SD)), .SDcols = "Species"] # Or if you're using this often you can write a wrapper set_add_group_id <- function(x, ..., .name = "group_id"){ id <- group_id(x, ...) data.table::set(x, j = .name, value = id) } set_add_group_id(iris_dt, desc(Species))[] mm_mpg <- mpg %>% select(manufacturer, model) %>% arrange(desc(pick(everything()))) # Sorted/non-sorted groups mm_mpg %>% add_group_id(across(everything()), .name = "sorted_id", order = TRUE) %>% add_group_id(manufacturer, model, .name = "not_sorted_id", order = FALSE) %>% distinct()
Calculate basic growth calculations on a rolling basis.
growth()
calculates the percent change between the totals of two numeric vectors
when they're of equal length, otherwise the percent change between the means.
rolling_growth()
does the same calculation on 1 numeric vector, on a rolling basis.
Pairs of windows of length n
, lagged by the value specified by lag
are compared in
a similar manner.
When lag = n
then data.table::frollsum()
is used,
otherwise data.table::frollmean()
is used.
growth(x, y, na.rm = FALSE, log = FALSE, inf_fill = NULL) rolling_growth( x, n = 1, lag = n, na.rm = FALSE, partial = TRUE, offset = NULL, weights = NULL, inf_fill = NULL, log = FALSE, ... )
growth(x, y, na.rm = FALSE, log = FALSE, inf_fill = NULL) rolling_growth( x, n = 1, lag = n, na.rm = FALSE, partial = TRUE, offset = NULL, weights = NULL, inf_fill = NULL, log = FALSE, ... )
x |
Numeric vector. |
y |
numeric vector |
na.rm |
Should missing values be removed when calculating window? Defaults to |
log |
If |
inf_fill |
Numeric value to replace |
n |
Rolling window size, default is 1. |
lag |
Lag of basic growth comparison, default is the rolling window size. |
partial |
Should rates be calculated outwith the window using partial windows?
If |
offset |
Numeric vector of values to use as offset, e.g. population sizes or exposure times. |
weights |
Importance weights. These can either be length 1 or the same length as x. Currently, no normalisation of weights occurs. |
... |
Further arguments to be passed on to |
growth
returns a numeric(1)
and rolling_growth
returns a numeric(length(x))
.
library(timeplyr) set.seed(42) # Growth rate is 6% per day x <- 10 * (1.06)^(0:25) # Simple growth from one day to the next rolling_growth(x, n = 1) # Growth comparing rolling 3 day cumulative rolling_growth(x, n = 3) # Growth comparing rolling 3 day cumulative, lagged by 1 day rolling_growth(x, n = 3, lag = 1) # Growth comparing windows of equal size rolling_growth(x, n = 3, partial = FALSE) # Seven day moving average growth roll_mean(rolling_growth(x), window = 7, partial = FALSE)
library(timeplyr) set.seed(42) # Growth rate is 6% per day x <- 10 * (1.06)^(0:25) # Simple growth from one day to the next rolling_growth(x, n = 1) # Growth comparing rolling 3 day cumulative rolling_growth(x, n = 3) # Growth comparing rolling 3 day cumulative, lagged by 1 day rolling_growth(x, n = 3, lag = 1) # Growth comparing windows of equal size rolling_growth(x, n = 3, partial = FALSE) # Seven day moving average growth roll_mean(rolling_growth(x), window = 7, partial = FALSE)
Calculate the rate of percentage change per unit time.
growth_rate(x, na.rm = FALSE, log = FALSE, inf_fill = NULL)
growth_rate(x, na.rm = FALSE, log = FALSE, inf_fill = NULL)
x |
Numeric vector. |
na.rm |
Should missing values be removed when calculating window?
Defaults to |
log |
If |
inf_fill |
Numeric value to replace |
It is assumed that x
is a vector of values with
a corresponding time index that increases regularly
with no gaps or missing values.
The output is to be interpreted as the average percent change per unit time.
For a rolling version that can calculate rates as you move through time,
see roll_growth_rate
.
For a more generalised method that incorporates
time gaps and complex time windows,
use time_roll_growth_rate
.
The growth rate can also be calculated using the geometric mean of percent changes.
The below identity should always hold:
`tail(roll_growth_rate(x, window = length(x)), 1) == growth_rate(x)`
numeric(1)
roll_growth_rate time_roll_growth_rate
library(timeplyr) set.seed(42) initial_investment <- 100 years <- 1990:2000 # Assume a rate of 8% increase with noise relative_increases <- 1.08 + rnorm(10, sd = 0.005) assets <- Reduce(`*`, relative_increases, init = initial_investment, accumulate = TRUE) assets # Note that this is approximately 8% growth_rate(assets) # We can also calculate the growth rate via geometric mean rel_diff <- exp(diff(log(assets))) all.equal(rel_diff, relative_increases) geometric_mean <- function(x, na.rm = TRUE, weights = NULL){ exp(collapse::fmean(log(x), na.rm = na.rm, w = weights)) } geometric_mean(rel_diff) == growth_rate(assets) # Weighted growth rate w <- c(rnorm(5)^2, rnorm(5)^4) geometric_mean(rel_diff, weights = w) # Rolling growth rate over the last n years roll_growth_rate(assets) # The same but using geometric means exp(roll_mean(log(c(NA, rel_diff)))) # Rolling growth rate over the last 5 years roll_growth_rate(assets, window = 5) roll_growth_rate(assets, window = 5, partial = FALSE) ## Rolling growth rate with gaps in time years2 <- c(1990, 1993, 1994, 1997, 1998, 2000) assets2 <- assets[years %in% years2] # Below does not incorporate time gaps into growth rate calculation # But includes helpful warning time_roll_growth_rate(assets2, window = 5, time = years2) # Time step allows us to calculate correct rates across time gaps time_roll_growth_rate(assets2, window = 5, time = years2, time_step = 1) # Time aware
library(timeplyr) set.seed(42) initial_investment <- 100 years <- 1990:2000 # Assume a rate of 8% increase with noise relative_increases <- 1.08 + rnorm(10, sd = 0.005) assets <- Reduce(`*`, relative_increases, init = initial_investment, accumulate = TRUE) assets # Note that this is approximately 8% growth_rate(assets) # We can also calculate the growth rate via geometric mean rel_diff <- exp(diff(log(assets))) all.equal(rel_diff, relative_increases) geometric_mean <- function(x, na.rm = TRUE, weights = NULL){ exp(collapse::fmean(log(x), na.rm = na.rm, w = weights)) } geometric_mean(rel_diff) == growth_rate(assets) # Weighted growth rate w <- c(rnorm(5)^2, rnorm(5)^4) geometric_mean(rel_diff, weights = w) # Rolling growth rate over the last n years roll_growth_rate(assets) # The same but using geometric means exp(roll_mean(log(c(NA, rel_diff)))) # Rolling growth rate over the last 5 years roll_growth_rate(assets, window = 5) roll_growth_rate(assets, window = 5, partial = FALSE) ## Rolling growth rate with gaps in time years2 <- c(1990, 1993, 1994, 1997, 1998, 2000) assets2 <- assets[years %in% years2] # Below does not incorporate time gaps into growth rate calculation # But includes helpful warning time_roll_growth_rate(assets2, window = 5, time = years2) # Time step allows us to calculate correct rates across time gaps time_roll_growth_rate(assets2, window = 5, time = years2, time_step = 1) # Time aware
Time interval utilities
interval_start(x) interval_end(x) interval_count(x) interval_range(x, na_rm = TRUE) interval_length(x, ...)
interval_start(x) interval_end(x) interval_count(x) interval_range(x, na_rm = TRUE) interval_length(x, ...)
x |
A 'time_interval'. |
na_rm |
Should |
... |
Additional arguments passed onto |
Utility functions for checking if date or datetime
is_date(x) is_datetime(x) is_time(x) is_time_or_num(x)
is_date(x) is_datetime(x) is_time(x) is_time_or_num(x)
x |
Time variable. |
A logical of length 1.
Are all numbers whole numbers?
is_whole_number(x, tol = .Machine$double.eps, na.rm = TRUE)
is_whole_number(x, tol = .Machine$double.eps, na.rm = TRUE)
x |
A numeric vector. |
tol |
tolerance value. |
na.rm |
Should |
This is a very efficient function that returns FALSE
if any number
is not a whole-number and TRUE
if all of them are.
x
is defined as a whole number vector
if all numbers satisfy abs(x - round(x)) < tol
.
NA
handlingNA
values are handled in a custom way.
If x
is an integer, TRUE
is always returned even if x
has missing values.
If x
has both missing values and decimal numbers, FALSE
is always returned.
If x
has missing values, and only whole numbers and na.rm = FALSE
, then
NA
is returned.
Basically NA
is only returned if na.rm = FALSE
and
x
is a double vector of only whole numbers and NA
values.
Inspired by the discussion in this thread: check-if-the-number-is-integer
A logical vector of length 1.
library(timeplyr) library(dplyr) # Has built-in tolerance sqrt(2)^2 %% 1 == 0 is_whole_number(sqrt(2)^2) is_whole_number(1) is_whole_number(1.2) x1 <- c(0.02, 0:10^5) x2 <- c(0:10^5, 0.02) is_whole_number(x1) is_whole_number(x2) # Somewhat more strict than all.equal all.equal(10^9 + 0.0001, round(10^9 + 0.0001)) is_whole_number(10^9 + 0.0001) # Can safely be used to select whole number variables starwars %>% select(where(is_whole_number)) # To reduce the size of any data frame one can use the below code df <- starwars %>% mutate(across(where(is_whole_number), as.integer))
library(timeplyr) library(dplyr) # Has built-in tolerance sqrt(2)^2 %% 1 == 0 is_whole_number(sqrt(2)^2) is_whole_number(1) is_whole_number(1.2) x1 <- c(0.02, 0:10^5) x2 <- c(0:10^5, 0.02) is_whole_number(x1) is_whole_number(x2) # Somewhat more strict than all.equal all.equal(10^9 + 0.0001, round(10^9 + 0.0001)) is_whole_number(10^9 + 0.0001) # Can safely be used to select whole number variables starwars %>% select(where(is_whole_number)) # To reduce the size of any data frame one can use the below code df <- starwars %>% mutate(across(where(is_whole_number), as.integer))
iso_week()
is a flexible function to return formatted
ISO weeks, with optional ISO year and ISO day.
isoday()
returns the day of the ISO week.
iso_week(x, year = TRUE, day = FALSE) isoday(x)
iso_week(x, year = TRUE, day = FALSE) isoday(x)
x |
Date vector. |
year |
Logical. If |
day |
Logical. If |
An ISO week vector of class character
.
library(timeplyr) library(lubridate) iso_week(today()) iso_week(today(), day = TRUE) iso_week(today(), year = FALSE, day = TRUE) iso_week(today(), year = FALSE, day = FALSE)
library(timeplyr) library(lubridate) iso_week(today()) iso_week(today(), day = TRUE) iso_week(today(), year = FALSE, day = TRUE) iso_week(today(), year = FALSE, day = FALSE)
Check for missing dates between first and last date
missing_dates(x) n_missing_dates(x)
missing_dates(x) n_missing_dates(x)
x |
A date or datetime vector, or a data frame. |
A date vector if x is a vector, or a list if x is a data.frame
.
collapse
and data.table
are used for the calculations.
q_summarise( data, ..., probs = seq(0, 1, 0.25), type = 7, pivot = c("wide", "long"), na.rm = TRUE, sort = df_group_by_order_default(data), .by = NULL, .cols = NULL )
q_summarise( data, ..., probs = seq(0, 1, 0.25), type = 7, pivot = c("wide", "long"), na.rm = TRUE, sort = df_group_by_order_default(data), .by = NULL, .cols = NULL )
data |
A data frame. |
... |
Variables used to calculate quantiles for. Tidy data-masking applies. |
probs |
Quantile probabilities. |
type |
An integer from 5-9 specifying which algorithm to use.
See |
pivot |
Should data be pivoted wide or long? Default is |
na.rm |
Should |
sort |
Should groups be sorted? Default is |
.by |
(Optional). A selection of columns to group by for this operation. Columns are specified using tidy-select. |
.cols |
(Optional) alternative to |
A data.table
containing the quantile values for each group.
library(timeplyr) library(dplyr) # Standard quantiles iris %>% q_summarise(Sepal.Length) # Quantiles by species iris %>% q_summarise(Sepal.Length, .by = Species) # Quantiles by species across multiple columns iris %>% q_summarise(Sepal.Length, Sepal.Width, probs = c(0, 1), .by = Species) # Long format if one desires, useful for ggplot2 iris %>% q_summarise(Sepal.Length, pivot = "long", .by = Species) # Example with lots of groups set.seed(20230606) df <- data.frame(x = rnorm(10^5), g = sample.int(10^5, replace = TRUE)) q_summarise(df, x, .by = g, sort = FALSE)
library(timeplyr) library(dplyr) # Standard quantiles iris %>% q_summarise(Sepal.Length) # Quantiles by species iris %>% q_summarise(Sepal.Length, .by = Species) # Quantiles by species across multiple columns iris %>% q_summarise(Sepal.Length, Sepal.Width, probs = c(0, 1), .by = Species) # Long format if one desires, useful for ggplot2 iris %>% q_summarise(Sepal.Length, pivot = "long", .by = Species) # Example with lots of groups set.seed(20230606) df <- data.frame(x = rnorm(10^5), g = sample.int(10^5, replace = TRUE)) q_summarise(df, x, .by = g, sort = FALSE)
One can set global options to be used in timeplyr. These options include:
time_type - Controls whether to use periods, durations or to decide automatically.
roll_month - Controls how to roll forward or backward impossible calendar days.
roll_dst - Controls how to roll forward or backward impossible date-times.
interval_style - Controls how time_interval
objects are formatted.
interval_sub_formatter - A function to format the start and end times of a time_interval
.
use_intervals - Controls whether time_intervals
are
returned whenever dates or date-times are aggregated. If this is FALSE
the start time (or left-hand side) is always returned.
reset_timeplyr_options()
reset_timeplyr_options()
Resets the timeplyr global options (prefixed with "timeplyr."
):
time_type, roll_month, roll_dst, interval_style,
interval_sub_formatter and use_intervals.
library(timeplyr) options(timeplyr.interval_style = "start") getOption("timeplyr.interval_style") reset_timeplyr_options() getOption("timeplyr.interval_style")
library(timeplyr) options(timeplyr.interval_style = "start") getOption("timeplyr.interval_style") reset_timeplyr_options() getOption("timeplyr.interval_style")
Inspired by 'collapse', roll_lag
and roll_diff
operate similarly to
flag
and fdiff
.
roll_lag(x, n = 1L, ...) ## Default S3 method: roll_lag(x, n = 1L, g = NULL, fill = NULL, ...) ## S3 method for class 'ts' roll_lag(x, n = 1L, g = NULL, fill = NULL, ...) ## S3 method for class 'zoo' roll_lag(x, n = 1L, g = NULL, fill = NULL, ...) roll_diff(x, n = 1L, ...) ## Default S3 method: roll_diff(x, n = 1L, g = NULL, fill = NULL, differences = 1L, ...) ## S3 method for class 'ts' roll_diff(x, n = 1L, g = NULL, fill = NULL, differences = 1L, ...) ## S3 method for class 'zoo' roll_diff(x, n = 1L, g = NULL, fill = NULL, differences = 1L, ...) diff_( x, n = 1L, differences = 1L, order = NULL, run_lengths = NULL, fill = NULL )
roll_lag(x, n = 1L, ...) ## Default S3 method: roll_lag(x, n = 1L, g = NULL, fill = NULL, ...) ## S3 method for class 'ts' roll_lag(x, n = 1L, g = NULL, fill = NULL, ...) ## S3 method for class 'zoo' roll_lag(x, n = 1L, g = NULL, fill = NULL, ...) roll_diff(x, n = 1L, ...) ## Default S3 method: roll_diff(x, n = 1L, g = NULL, fill = NULL, differences = 1L, ...) ## S3 method for class 'ts' roll_diff(x, n = 1L, g = NULL, fill = NULL, differences = 1L, ...) ## S3 method for class 'zoo' roll_diff(x, n = 1L, g = NULL, fill = NULL, differences = 1L, ...) diff_( x, n = 1L, differences = 1L, order = NULL, run_lengths = NULL, fill = NULL )
x |
A vector or data frame. |
n |
Lag. This will be recycled to match the length of x and can be negative. |
... |
Arguments passed onto appropriate method. |
g |
Grouping vector. This can be a vector, data frame or |
fill |
Value to fill the first |
differences |
Number indicating the number of times to recursively apply
the differencing algorithm. If |
order |
Optionally specify an ordering with which to apply the lags/differences. This is useful for example when applying lags chronologically using an unsorted time variable. |
run_lengths |
Optional integer vector of run lengths that defines
the size of each lag run. For example, supplying |
While these may not be as fast the 'collapse' equivalents,
they are adequately fast and efficient.
A key difference between roll_lag
and flag
is that g
does not need
to be sorted for the result to be correct.
Furthermore, a vector of lags can be supplied for a custom rolling lag.
roll_diff()
silently returns NA
when there is integer overflow.
Both roll_lag()
and roll_diff()
apply recursively to list elements.
A vector the same length as x
.
library(timeplyr) x <- 1:10 roll_lag(x) # Lag roll_lag(x, -1) # Lead roll_diff(x) # Lag diff roll_diff(x, -1) # Lead diff # Using cheapr::lag_sequence() # Differences lagged at 5, first 5 differences are compared to x[1] roll_diff(x, cheapr::lag_sequence(length(x), 5, partial = TRUE)) # Like diff() but x/y instead of x-y quotient <- function(x, n = 1L){ x / roll_lag(x, n) } # People often call this a growth rate # but it's just a percentage difference # See ?roll_growth_rate for growth rate calculations quotient(1:10)
library(timeplyr) x <- 1:10 roll_lag(x) # Lag roll_lag(x, -1) # Lead roll_diff(x) # Lag diff roll_diff(x, -1) # Lead diff # Using cheapr::lag_sequence() # Differences lagged at 5, first 5 differences are compared to x[1] roll_diff(x, cheapr::lag_sequence(length(x), 5, partial = TRUE)) # Like diff() but x/y instead of x-y quotient <- function(x, n = 1L){ x / roll_lag(x, n) } # People often call this a growth rate # but it's just a percentage difference # See ?roll_growth_rate for growth rate calculations quotient(1:10)
NA
fillA fast and efficient by-group method for
"last-observation-carried-forward" NA
filling.
roll_na_fill(x, g = NULL, fill_limit = Inf) .roll_na_fill(x, fill_limit = Inf)
roll_na_fill(x, g = NULL, fill_limit = Inf) .roll_na_fill(x, fill_limit = Inf)
x |
A vector. |
g |
An object use for grouping x This may be a vector or data frame for example. |
fill_limit |
(Optional) maximum number of consecutive NAs to fill
per |
When supplying groups using g
, this method uses radixorder(g)
to
specify how to loop through x
, making this extremely efficient.
When x
contains zero or all NA
values, then x
is returned with no copy
made.
.roll_na_fill()
is the same as roll_na_fill()
but without a g argument and
it performs no sanity checks. It is passed straight to c++ which makes it efficient for
loops.
A filled vector of x
the same length as x
.
library(timeplyr) library(dplyr) library(data.table) words <- do.call(paste0, do.call(expand.grid, rep(list(letters), 3))) groups <- sample(words, size = 10^5, replace = TRUE) x <- sample.int(10^2, 10^5, TRUE) x[sample.int(10^5, 10^4)] <- NA dt <- data.table(x, groups) filled <- roll_na_fill(x, groups) library(zoo) # Summary # Latest version of vctrs with their vec_fill_missing # Is the fastest but not most memory efficient # For low repetitions and large vectors, data.table is best # For large numbers of repetitions (groups) and data # that is sorted by groups # timeplyr is fastest # No groups bench::mark(e1 = dt[, filled1 := timeplyr::roll_na_fill(x)][]$filled1, e2 = dt[, filled2 := data.table::nafill(x, type = "locf")][]$filled2, e3 = dt[, filled3 := vctrs::vec_fill_missing(x)][]$filled3, e4 = dt[, filled4 := zoo::na.locf0(x)][]$filled4, e5 = dt[, filled5 := timeplyr::.roll_na_fill(x)][]$filled5) # With group bench::mark(e1 = dt[, filled1 := timeplyr::roll_na_fill(x, groups)][]$filled1, e2 = dt[, filled2 := data.table::nafill(x, type = "locf"), by = groups][]$filled2, e3 = dt[, filled3 := vctrs::vec_fill_missing(x), by = groups][]$filled3, e4 = dt[, filled4 := timeplyr::.roll_na_fill(x), by = groups][]$filled4) # Data sorted by groups setkey(dt, groups) bench::mark(e1 = dt[, filled1 := timeplyr::roll_na_fill(x, groups)][]$filled1, e2 = dt[, filled2 := data.table::nafill(x, type = "locf"), by = groups][]$filled2, e3 = dt[, filled3 := vctrs::vec_fill_missing(x), by = groups][]$filled3, e4 = dt[, filled4 := timeplyr::.roll_na_fill(x), by = groups][]$filled4)
library(timeplyr) library(dplyr) library(data.table) words <- do.call(paste0, do.call(expand.grid, rep(list(letters), 3))) groups <- sample(words, size = 10^5, replace = TRUE) x <- sample.int(10^2, 10^5, TRUE) x[sample.int(10^5, 10^4)] <- NA dt <- data.table(x, groups) filled <- roll_na_fill(x, groups) library(zoo) # Summary # Latest version of vctrs with their vec_fill_missing # Is the fastest but not most memory efficient # For low repetitions and large vectors, data.table is best # For large numbers of repetitions (groups) and data # that is sorted by groups # timeplyr is fastest # No groups bench::mark(e1 = dt[, filled1 := timeplyr::roll_na_fill(x)][]$filled1, e2 = dt[, filled2 := data.table::nafill(x, type = "locf")][]$filled2, e3 = dt[, filled3 := vctrs::vec_fill_missing(x)][]$filled3, e4 = dt[, filled4 := zoo::na.locf0(x)][]$filled4, e5 = dt[, filled5 := timeplyr::.roll_na_fill(x)][]$filled5) # With group bench::mark(e1 = dt[, filled1 := timeplyr::roll_na_fill(x, groups)][]$filled1, e2 = dt[, filled2 := data.table::nafill(x, type = "locf"), by = groups][]$filled2, e3 = dt[, filled3 := vctrs::vec_fill_missing(x), by = groups][]$filled3, e4 = dt[, filled4 := timeplyr::.roll_na_fill(x), by = groups][]$filled4) # Data sorted by groups setkey(dt, groups) bench::mark(e1 = dt[, filled1 := timeplyr::roll_na_fill(x, groups)][]$filled1, e2 = dt[, filled2 := data.table::nafill(x, type = "locf"), by = groups][]$filled2, e3 = dt[, filled3 := vctrs::vec_fill_missing(x), by = groups][]$filled3, e4 = dt[, filled4 := timeplyr::.roll_na_fill(x), by = groups][]$filled4)
An efficient method for rolling sum, mean and growth rate for many groups.
roll_sum( x, window = Inf, g = NULL, partial = TRUE, weights = NULL, na.rm = TRUE, ... ) roll_mean( x, window = Inf, g = NULL, partial = TRUE, weights = NULL, na.rm = TRUE, ... ) roll_geometric_mean( x, window = Inf, g = NULL, partial = TRUE, weights = NULL, na.rm = TRUE, ... ) roll_harmonic_mean( x, window = Inf, g = NULL, partial = TRUE, weights = NULL, na.rm = TRUE, ... ) roll_growth_rate( x, window = Inf, g = NULL, partial = TRUE, na.rm = FALSE, log = FALSE, inf_fill = NULL )
roll_sum( x, window = Inf, g = NULL, partial = TRUE, weights = NULL, na.rm = TRUE, ... ) roll_mean( x, window = Inf, g = NULL, partial = TRUE, weights = NULL, na.rm = TRUE, ... ) roll_geometric_mean( x, window = Inf, g = NULL, partial = TRUE, weights = NULL, na.rm = TRUE, ... ) roll_harmonic_mean( x, window = Inf, g = NULL, partial = TRUE, weights = NULL, na.rm = TRUE, ... ) roll_growth_rate( x, window = Inf, g = NULL, partial = TRUE, na.rm = FALSE, log = FALSE, inf_fill = NULL )
x |
Numeric vector, data frame, or list. |
window |
Rolling window size, default is |
g |
Grouping object passed directly to |
partial |
Should calculations be done using partial windows?
Default is |
weights |
Importance weights. Must be the same length as x. Currently, no normalisation of weights occurs. |
na.rm |
Should missing values be removed for the calculation?
The default is |
... |
Additional arguments passed to |
log |
For |
inf_fill |
For |
roll_sum
and roll_mean
support parallel computations when
x
is a data frame of multiple columns. roll_geometric_mean
and roll_harmonic_mean
are convenience functions that
utilise roll_mean
. roll_growth_rate
calculates the rate of percentage
change per unit time on a rolling basis.
A numeric vector the same length as x
when x
is a vector,
or a list when x
is a data.frame
.
library(timeplyr) x <- 1:10 roll_sum(x) # Simple rolling total roll_mean(x) # Simple moving average roll_sum(x, window = 3) roll_mean(x, window = 3) roll_sum(x, window = 3, partial = FALSE) roll_mean(x, window = 3, partial = FALSE) # Plot of expected value of 'coin toss' over many flips set.seed(42) x <- sample(c(1, 0), 10^3, replace = TRUE) ev <- roll_mean(x) plot(ev) abline(h = 0.5, lty = 2) all.equal(roll_sum(iris$Sepal.Length, g = iris$Species), ave(iris$Sepal.Length, iris$Species, FUN = cumsum)) # The below is run using parallel computations where applicable roll_sum(iris[, 1:4], window = 7, g = iris$Species) library(data.table) library(bench) df <- data.table(g = sample.int(10^4, 10^5, TRUE), x = rnorm(10^5)) mark(e1 = df[, mean := frollmean(x, n = 7, align = "right", na.rm = FALSE), by = "g"]$mean, e2 = df[, mean := roll_mean(x, window = 7, g = get("g"), partial = FALSE, na.rm = FALSE)]$mean)
library(timeplyr) x <- 1:10 roll_sum(x) # Simple rolling total roll_mean(x) # Simple moving average roll_sum(x, window = 3) roll_mean(x, window = 3) roll_sum(x, window = 3, partial = FALSE) roll_mean(x, window = 3, partial = FALSE) # Plot of expected value of 'coin toss' over many flips set.seed(42) x <- sample(c(1, 0), 10^3, replace = TRUE) ev <- roll_mean(x) plot(ev) abline(h = 0.5, lty = 2) all.equal(roll_sum(iris$Sepal.Length, g = iris$Species), ave(iris$Sepal.Length, iris$Species, FUN = cumsum)) # The below is run using parallel computations where applicable roll_sum(iris[, 1:4], window = 7, g = iris$Species) library(data.table) library(bench) df <- data.table(g = sample.int(10^4, 10^5, TRUE), x = rnorm(10^5)) mark(e1 = df[, mean := frollmean(x, n = 7, align = "right", na.rm = FALSE), by = "g"]$mean, e2 = df[, mean := roll_mean(x, window = 7, g = get("g"), partial = FALSE, na.rm = FALSE)]$mean)
collapse
and data.table
are used for the calculations.
stat_summarise( data, ..., stat = .stat_fns[1:3], q_probs = NULL, na.rm = TRUE, sort = df_group_by_order_default(data), .count_name = NULL, .names = NULL, .by = NULL, .cols = NULL, inform_stats = TRUE, as_tbl = FALSE ) .stat_fns
stat_summarise( data, ..., stat = .stat_fns[1:3], q_probs = NULL, na.rm = TRUE, sort = df_group_by_order_default(data), .count_name = NULL, .names = NULL, .by = NULL, .cols = NULL, inform_stats = TRUE, as_tbl = FALSE ) .stat_fns
data |
A data frame. |
... |
Variables to apply the statistical functions to. Tidy data-masking applies. |
stat |
A character vector of statistical summaries to apply.
This can be one or more of the following: |
q_probs |
(Optional) Quantile probabilities.
If supplied, |
na.rm |
Should |
sort |
Should groups be sorted? Default is |
.count_name |
Name of count column, default is "n". |
.names |
An optional glue specification passed to |
.by |
(Optional). A selection of columns to group by for this operation. Columns are specified using tidy-select. |
.cols |
(Optional) alternative to |
inform_stats |
Should available stat functions be displayed
at the start of each session? Default is |
as_tbl |
Should the result be a |
.stat_fns
An object of class character
of length 14.
stat_summarise()
can apply multiple functions to multiple variables.
stat_summarise()
is equivalent to data %>% group_by(...) %>% summarise(across(..., list(...)))
but is faster and more efficient and accepts limited statistical functions.
A summary data.table
containing the summary values for each group.
library(timeplyr) library(dplyr) stat_df <- iris %>% stat_summarise(Sepal.Length, .by = Species) # Join quantile info too q_df <- iris %>% q_summarise(Sepal.Length, .by = Species) summary_df <- left_join(stat_df, q_df, by = "Species") summary_df # Multiple cols iris %>% group_by(Species) %>% stat_summarise(across(contains("Width")), stat = c("min", "max", "mean", "sd"))
library(timeplyr) library(dplyr) stat_df <- iris %>% stat_summarise(Sepal.Length, .by = Species) # Join quantile info too q_df <- iris %>% q_summarise(Sepal.Length, .by = Species) summary_df <- left_join(stat_df, q_df, by = "Species") summary_df # Multiple cols iris %>% group_by(Species) %>% stat_summarise(across(contains("Width")), stat = c("min", "max", "mean", "sd"))
Aggregate time to a higher unit for possibly many groups with respect to a time index.
time_aggregate( x, time_by = NULL, from = NULL, to = NULL, time_type = getOption("timeplyr.time_type", "auto"), roll_month = getOption("timeplyr.roll_month", "preday"), roll_dst = getOption("timeplyr.roll_dst", "NA"), time_floor = FALSE, week_start = getOption("lubridate.week.start", 1), as_interval = getOption("timeplyr.use_intervals", TRUE) )
time_aggregate( x, time_by = NULL, from = NULL, to = NULL, time_type = getOption("timeplyr.time_type", "auto"), roll_month = getOption("timeplyr.roll_month", "preday"), roll_dst = getOption("timeplyr.roll_dst", "NA"), time_floor = FALSE, week_start = getOption("lubridate.week.start", 1), as_interval = getOption("timeplyr.use_intervals", TRUE) )
x |
Time vector. |
time_by |
Time unit.
|
from |
Start. |
to |
End. |
time_type |
If "auto", |
roll_month |
Control how impossible dates are handled when month or year arithmetic is involved. |
roll_dst |
See |
time_floor |
Should |
week_start |
day on which week starts following ISO conventions - 1
means Monday (default), 7 means Sunday.
This is only used when |
as_interval |
Should result be a |
time_aggregate
aggregates time using
distinct moving time range blocks of a specified time unit.
The actual calculation is extremely simple and essentially requires a subtraction, a rounding and an addition.
To perform a by-group time aggregation, simply supply
collapse::fmin(x, g = groups, TRA = "replace_fill")
as the
from
argument.
A time_interval
.
library(timeplyr) library(nycflights13) library(lubridate) library(dplyr) sunique <- function(x) sort(unique(x)) hours <- sunique(flights$time_hour) days <- as_date(hours) # Aggregate by week or any time unit easily sunique(time_aggregate(hours, "week")) sunique(time_aggregate(hours, ddays(14))) sunique(time_aggregate(hours, "month")) sunique(time_aggregate(days, "month")) # Left aligned sunique(time_aggregate(days, "quarter")) # Very fast by group aggregation start <- collapse::fmin(flights$time_hour, g = flights$tailnum, TRA = "replace_fill") flights %>% mutate(start = collapse::fmin(time_hour, g = list(origin, dest), TRA = "replace_fill")) %>% mutate(week = time_aggregate(time_hour, dweeks(1), from = start)) %>% select(origin, dest, time_hour, week)
library(timeplyr) library(nycflights13) library(lubridate) library(dplyr) sunique <- function(x) sort(unique(x)) hours <- sunique(flights$time_hour) days <- as_date(hours) # Aggregate by week or any time unit easily sunique(time_aggregate(hours, "week")) sunique(time_aggregate(hours, ddays(14))) sunique(time_aggregate(hours, "month")) sunique(time_aggregate(days, "month")) # Left aligned sunique(time_aggregate(days, "quarter")) # Very fast by group aggregation start <- collapse::fmin(flights$time_hour, g = flights$tailnum, TRA = "replace_fill") flights %>% mutate(start = collapse::fmin(time_hour, g = list(origin, dest), TRA = "replace_fill")) %>% mutate(week = time_aggregate(time_hour, dweeks(1), from = start)) %>% select(origin, dest, time_hour, week)
time_by
groups a time variable by a specified time unit like
for example "days" or "weeks".
It can be used exactly like dplyr::group_by
.
time_by( data, time, time_by = NULL, from = NULL, to = NULL, .name = paste0("time_intv_", time_by_pretty(time_by, "_")), .add = FALSE, time_type = getOption("timeplyr.time_type", "auto"), as_interval = getOption("timeplyr.use_intervals", TRUE), .time_by_group = TRUE ) time_by_span(x) time_by_var(x) time_by_units(x)
time_by( data, time, time_by = NULL, from = NULL, to = NULL, .name = paste0("time_intv_", time_by_pretty(time_by, "_")), .add = FALSE, time_type = getOption("timeplyr.time_type", "auto"), as_interval = getOption("timeplyr.use_intervals", TRUE), .time_by_group = TRUE ) time_by_span(x) time_by_var(x) time_by_units(x)
data |
A data frame. |
time |
Time variable (data-masking). |
time_by |
Time unit.
|
from |
(Optional) Start time. |
to |
(Optional) end time. |
.name |
An optional glue specification passed to |
.add |
Should the time groups be added to existing groups?
Default is |
time_type |
If "auto", |
as_interval |
Should time variable be a |
.time_by_group |
Should the time aggregations be built on a
group-by-group basis (the default), or should the time variable be aggregated
using the full data? If done by group, different groups may contain
different time sequences. This only applies when |
x |
A |
A time_tbl_df
which for practical purposes can be treated the
same way as a dplyr grouped_df
.
library(dplyr) library(timeplyr) library(nycflights13) library(lubridate) # Basic usage hourly_flights <- flights %>% time_by(time_hour) # Detects time granularity hourly_flights time_by_span(hourly_flights) monthly_flights <- flights %>% time_by(time_hour, "month") weekly_flights <- flights %>% time_by(time_hour, "week", from = floor_date(min(time_hour), "week")) monthly_flights %>% count() weekly_flights %>% summarise(n = n(), arr_delay = mean(arr_delay, na.rm = TRUE)) # To aggregate multiple variables, use time_aggregate flights %>% select(time_hour) %>% mutate(across(everything(), \(x) time_aggregate(x, time_by = "weeks"))) %>% count(time_hour)
library(dplyr) library(timeplyr) library(nycflights13) library(lubridate) # Basic usage hourly_flights <- flights %>% time_by(time_hour) # Detects time granularity hourly_flights time_by_span(hourly_flights) monthly_flights <- flights %>% time_by(time_hour, "month") weekly_flights <- flights %>% time_by(time_hour, "week", from = floor_date(min(time_hour), "week")) monthly_flights %>% count() weekly_flights %>% summarise(n = n(), arr_delay = mean(arr_delay, na.rm = TRUE)) # To aggregate multiple variables, use time_aggregate flights %>% select(time_hour) %>% mutate(across(everything(), \(x) time_aggregate(x, time_by = "weeks"))) %>% count(time_hour)
time_count
is deprecatedtime_count
is deprecated
time_count( data, time = NULL, ..., time_by = NULL, from = NULL, to = NULL, .name = "{.col}", complete = FALSE, wt = NULL, name = NULL, sort = FALSE, .by = NULL, time_floor = FALSE, week_start = getOption("lubridate.week.start", 1), time_type = getOption("timeplyr.time_type", "auto"), roll_month = getOption("timeplyr.roll_month", "preday"), roll_dst = getOption("timeplyr.roll_dst", "NA"), as_interval = getOption("timeplyr.use_intervals", TRUE) )
time_count( data, time = NULL, ..., time_by = NULL, from = NULL, to = NULL, .name = "{.col}", complete = FALSE, wt = NULL, name = NULL, sort = FALSE, .by = NULL, time_floor = FALSE, week_start = getOption("lubridate.week.start", 1), time_type = getOption("timeplyr.time_type", "auto"), roll_month = getOption("timeplyr.roll_month", "preday"), roll_dst = getOption("timeplyr.roll_dst", "NA"), as_interval = getOption("timeplyr.use_intervals", TRUE) )
data |
Deprecated. |
time |
Deprecated. |
... |
Deprecated. |
time_by |
Deprecated. |
from |
Deprecated. |
to |
Deprecated. |
.name |
Deprecated. |
complete |
Deprecated. |
wt |
Deprecated. |
name |
Deprecated. |
sort |
Deprecated. |
.by |
Deprecated. |
time_floor |
Deprecated. |
week_start |
Deprecated. |
time_type |
Deprecated. |
roll_month |
Deprecated. |
roll_dst |
Deprecated. |
as_interval |
Deprecated. |
Useful functions especially for when plotting time-series.
time_cut
makes approximately n
groups of equal time range.
It prioritises the highest time unit possible, making axes look
less cluttered and thus prettier. time_breaks
returns only the breaks.
time_cut_width
cuts the time vector into groups of equal width, e.g. a day.
time_cut( x, n = 5, time_by = NULL, from = NULL, to = NULL, time_floor = FALSE, week_start = getOption("lubridate.week.start", 1), time_type = getOption("timeplyr.time_type", "auto"), roll_month = getOption("timeplyr.roll_month", "preday"), roll_dst = getOption("timeplyr.roll_dst", "NA"), as_interval = getOption("timeplyr.use_intervals", TRUE) ) time_breaks( x, n = 5, time_by = NULL, from = NULL, to = NULL, time_floor = FALSE, week_start = getOption("lubridate.week.start", 1), time_type = getOption("timeplyr.time_type", "auto"), roll_month = getOption("timeplyr.roll_month", "preday"), roll_dst = getOption("timeplyr.roll_dst", "NA") ) time_cut_width( x, time_by = NULL, from = NULL, as_interval = getOption("timeplyr.use_intervals", TRUE) )
time_cut( x, n = 5, time_by = NULL, from = NULL, to = NULL, time_floor = FALSE, week_start = getOption("lubridate.week.start", 1), time_type = getOption("timeplyr.time_type", "auto"), roll_month = getOption("timeplyr.roll_month", "preday"), roll_dst = getOption("timeplyr.roll_dst", "NA"), as_interval = getOption("timeplyr.use_intervals", TRUE) ) time_breaks( x, n = 5, time_by = NULL, from = NULL, to = NULL, time_floor = FALSE, week_start = getOption("lubridate.week.start", 1), time_type = getOption("timeplyr.time_type", "auto"), roll_month = getOption("timeplyr.roll_month", "preday"), roll_dst = getOption("timeplyr.roll_dst", "NA") ) time_cut_width( x, time_by = NULL, from = NULL, as_interval = getOption("timeplyr.use_intervals", TRUE) )
x |
Time variable. |
n |
Number of breaks. |
time_by |
Time unit.
|
from |
Time series start date. |
to |
Time series end date. |
time_floor |
Logical. Should the initial date/datetime be floored before building the sequence? |
week_start |
day on which week starts following ISO conventions - 1
means Monday (default), 7 means Sunday.
This is only used when |
time_type |
If "auto", |
roll_month |
Control how impossible dates are handled when
month or year arithmetic is involved.
Options are "preday", "boundary", "postday", "full" and "NA".
See |
roll_dst |
See |
as_interval |
Should result be a |
To retrieve regular time breaks that simply spans the range of x
,
use time_seq()
or time_aggregate()
.
This can also be achieved in time_cut()
by supplying n = Inf
.
By default time_cut()
will try to find
the prettiest way of cutting the interval by
trying to cut the date/date-times into
groups of the highest possible time units,
starting at years and ending at milliseconds.
When x
is a numeric vector, time_cut
will behave similar to time_cut
except for 3 things:
The intervals are all right-open and of equal width.
The left value of the leftmost interval is always min(x)
.
Up to n
breaks are created, i.e <= n
breaks. This is to prioritise
pretty breaks.
time_cut
is a generalisation of time_summarisev
such that the
below identity should always hold:
identical(time_cut(x, n = Inf, as_factor = FALSE), time_summarisev(x))
Or also:
breaks <- time_breaks(x, n = Inf) identical(breaks[unclass(time_cut(x, n = Inf))], time_summarisev(x))
time_breaks
returns a vector of breaks. time_cut
returns either a vector or time_interval
. time_cut_width
cuts the time vector into groups of equal width, e.g. a day,
and returns the same object as time_cut
. This is analogous to
ggplot2::cut_width
but the intervals are all right-open.
library(timeplyr) library(lubridate) library(ggplot2) library(dplyr) time_cut(1:10, n = 5) # Easily create custom time breaks df <- nycflights13::flights %>% fslice_sample(n = 10, seed = 8192821) %>% select(time_hour) %>% farrange(time_hour) %>% mutate(date = as_date(time_hour)) # time_cut() and time_breaks() automatically find a # suitable way to cut the data options(timeplyr.use_intervals = TRUE) time_cut(df$date) # Works with datetimes as well time_cut(df$time_hour, n = 5) # <= 5 breaks # Custom formatting options(timeplyr.interval_sub_formatter = function(x) format(x, format = "%Y %b")) time_cut(df$date, time_by = "month") # Just the breaks time_breaks(df$date, n = 5, time_by = "month") cut_dates <- time_cut(df$date) date_breaks <- time_breaks(df$date) # WHen n = Inf and as_factor = FALSE, it should be equivalent to using # time_aggregate or time_summarisev identical(time_cut(df$date, n = Inf, time_by = "month"), time_summarisev(df$date, time_by = "month")) identical(time_summarisev(df$date, time_by = "month"), time_aggregate(df$date, time_by = "month")) # To get exact breaks at regular intervals, use time_expandv weekly_breaks <- time_expandv(df$date, time_by = "5 weeks", week_start = 1, # Monday time_floor = TRUE) weekly_labels <- format(weekly_breaks, "%b-%d") df %>% time_by(date, time_by = "week", .name = "date") %>% count() %>% mutate(date = interval_start(date)) %>% ggplot(aes(x = date, y = n)) + geom_bar(stat = "identity") + scale_x_date(breaks = weekly_breaks, labels = weekly_labels) reset_timeplyr_options()
library(timeplyr) library(lubridate) library(ggplot2) library(dplyr) time_cut(1:10, n = 5) # Easily create custom time breaks df <- nycflights13::flights %>% fslice_sample(n = 10, seed = 8192821) %>% select(time_hour) %>% farrange(time_hour) %>% mutate(date = as_date(time_hour)) # time_cut() and time_breaks() automatically find a # suitable way to cut the data options(timeplyr.use_intervals = TRUE) time_cut(df$date) # Works with datetimes as well time_cut(df$time_hour, n = 5) # <= 5 breaks # Custom formatting options(timeplyr.interval_sub_formatter = function(x) format(x, format = "%Y %b")) time_cut(df$date, time_by = "month") # Just the breaks time_breaks(df$date, n = 5, time_by = "month") cut_dates <- time_cut(df$date) date_breaks <- time_breaks(df$date) # WHen n = Inf and as_factor = FALSE, it should be equivalent to using # time_aggregate or time_summarisev identical(time_cut(df$date, n = Inf, time_by = "month"), time_summarisev(df$date, time_by = "month")) identical(time_summarisev(df$date, time_by = "month"), time_aggregate(df$date, time_by = "month")) # To get exact breaks at regular intervals, use time_expandv weekly_breaks <- time_expandv(df$date, time_by = "5 weeks", week_start = 1, # Monday time_floor = TRUE) weekly_labels <- format(weekly_breaks, "%b-%d") df %>% time_by(date, time_by = "week", .name = "date") %>% count() %>% mutate(date = interval_start(date)) %>% ggplot(aes(x = date, y = n)) + geom_bar(stat = "identity") + scale_x_date(breaks = weekly_breaks, labels = weekly_labels) reset_timeplyr_options()
The time difference between 2 date or date-time vectors.
time_diff( x, y, time_by = 1L, time_type = getOption("timeplyr.time_type", "auto") )
time_diff( x, y, time_by = 1L, time_type = getOption("timeplyr.time_type", "auto") )
x |
Start date or datetime. |
y |
End date or datetime. |
time_by |
Must be one of the three (Default is 1):
|
time_type |
Time difference type: "auto", "duration" or "period". |
When time_by
is a numeric vector, e.g time_by = 1
then
base arithmetic using base::`-`
is used, otherwise 'lubridate' style
durations and periods are used.
Some more exotic time units such as quarters, fortnights, etcetera
can be specified. See .time_units
for more choices.
A numeric vector recycled to the length of max(length(x), length(y))
.
library(timeplyr) library(lubridate) time_diff(today(), today() + days(10), time_by = "days") time_diff(today(), today() + days((0:3) * 7), time_by = weeks(1)) time_diff(today(), today() + days(100), time_by = list("days" = 1:100)) time_diff(1, 1 + 0:100, time_by = 3) library(nycflights13) library(bench) # Period differences are much faster # check = FALSE because the results are fractionally different. # lubridate:::adjust_estimate likely has a typo in the first while loop mark(timeplyr = time_diff(flights$time_hour, today(), "weeks", time_type = "period"), lubridate = interval(flights$time_hour, today()) / weeks(1), check = FALSE)
library(timeplyr) library(lubridate) time_diff(today(), today() + days(10), time_by = "days") time_diff(today(), today() + days((0:3) * 7), time_by = weeks(1)) time_diff(today(), today() + days(100), time_by = list("days" = 1:100)) time_diff(1, 1 + 0:100, time_by = 3) library(nycflights13) library(bench) # Period differences are much faster # check = FALSE because the results are fractionally different. # lubridate:::adjust_estimate likely has a typo in the first while loop mark(timeplyr = time_diff(flights$time_hour, today(), "weeks", time_type = "period"), lubridate = interval(flights$time_hour, today()) / weeks(1), check = FALSE)
Calculate how much time has passed
on a rolling or cumulative basis.
time_elapsed( x, time_by = NULL, g = NULL, time_type = getOption("timeplyr.time_type", "auto"), rolling = TRUE, fill = NA, na_skip = TRUE )
time_elapsed( x, time_by = NULL, g = NULL, time_type = getOption("timeplyr.time_type", "auto"), rolling = TRUE, fill = NA, na_skip = TRUE )
x |
Time variable. |
time_by |
Must be one of the three:
|
g |
Object to be used for grouping |
time_type |
Time type, either "auto", "duration" or "period".
With larger data, it is recommended to use |
rolling |
If |
fill |
When |
na_skip |
Should |
time_elapsed()
is quite efficient when there are many groups,
especially if your data is sorted in order of those groups.
In the case that g
is supplied, it is most efficient when your data is
sorted by g
.
When na_skip
is TRUE
and rolling
is also TRUE
, NA
values are simply
skipped and hence the time differences between the current value and the
previous non-NA value are calculated. For example,
c(3, 4, 6, NA, NA, 9)
becomes c(NA, 1, 2, NA, NA, 3)
.
When na_skip
is TRUE
and rolling
is FALSE
, time differences between
the current value and the first non-NA value of the series are calculated.
For example,
c(NA, NA, 3, 4, 6, NA, 8)
becomes c(NA, NA, 0, 1, 3, NA, 5)
.
A numeric vector the same length as x
.
library(timeplyr) library(dplyr) library(lubridate) x <- time_seq(today(), length.out = 25, time_by = "3 days") time_elapsed(x) time_elapsed(x, rolling = FALSE, time_by = "day") # Grouped example set.seed(99) # ~ 100k groups, 1m rows x <- sample(time_seq_v2(20, today(), "day"), 10^6, TRUE) g <- sample.int(10^5, 10^6, TRUE) time_elapsed(x, time_by = "day", g = g)
library(timeplyr) library(dplyr) library(lubridate) x <- time_seq(today(), length.out = 25, time_by = "3 days") time_elapsed(x) time_elapsed(x, rolling = FALSE, time_by = "day") # Grouped example set.seed(99) # ~ 100k groups, 1m rows x <- sample(time_seq_v2(20, today(), "day"), 10^6, TRUE) g <- sample.int(10^5, 10^6, TRUE) time_elapsed(x, time_by = "day", g = g)
This function assigns episodes to events based on a pre-defined threshold of a chosen time unit.
time_episodes( data, time, time_by = NULL, window = 1, roll_episode = TRUE, switch_on_boundary = TRUE, fill = 0, .add = FALSE, event = NULL, time_type = getOption("timeplyr.time_type", "auto"), .by = NULL )
time_episodes( data, time, time_by = NULL, window = 1, roll_episode = TRUE, switch_on_boundary = TRUE, fill = 0, .add = FALSE, event = NULL, time_type = getOption("timeplyr.time_type", "auto"), .by = NULL )
data |
A data frame. |
time |
Date or datetime variable to use for the episode calculation.
Supply the variable using |
time_by |
Time units used to calculate episode flags.
If
|
window |
Single number defining the episode threshold.
When |
roll_episode |
Logical.
Should episodes be calculated using a rolling or fixed window?
If |
switch_on_boundary |
When an exact amount of time
(specified in |
fill |
Value to fill first time elapsed value. Only applicable when
|
.add |
Should episodic variables be added to the data? |
event |
(Optional) List that encodes which rows are events,
and which aren't.
By default |
time_type |
Time type, either "auto", "duration" or "period".
With larger data, it is recommended to use |
.by |
(Optional). A selection of columns to group by for this operation.
Columns are specified using |
time_episodes()
calculates the time elapsed (rolling or fixed) between
successive events, and flags these events as episodes or not based on how much
time has passed.
An example of episodic analysis can include disease infections over time.
In this example, a positive test result represents an event and
a new infection represents a new episode.
It is assumed that after a pre-determined amount of time, a positive result represents a new episode of infection.
To perform simple time-since-event analysis, which means one
is not interested in episodes, simply use time_elapsed()
instead.
To find implicit missing gaps in time, set window
to 1
and
switch_on_boundary
to FALSE
. Any event classified as an
episode in this scenario is an event following a gap in time.
The data are always sorted before calculation and then sorted back to the input order.
4 Key variables will be calculated:
ep_id - An integer variable signifying
which episode each event belongs to.
Non-events are assigned NA
. ep_id
is an increasing integer starting at 1.
In the infections scenario, 1 are positives within the
first episode of infection,
2 are positives within the second episode of infection and so on.
ep_id_new - An integer variable signifying the first instance of each new episode. This is an increasing integer where 0 signifies within-episode observations and >= 1 signifies the first instance of the respective episode.
t_elapsed - The time elapsed since the last event.
When roll_episode = FALSE
, this becomes the time elapsed since the
first event of the current episode.
Time units are specified in the by argument.
ep_start - Start date/datetime of the episode.
data.table
and collapse
are used for speed and efficiency.
A data.frame
in the same order as it was given.
library(timeplyr) library(dplyr) library(nycflights13) library(lubridate) library(ggplot2) # Say we want to flag origin-destination pairs # that haven't seen departures or arrivals for a week events <- flights %>% mutate(date = as_date(time_hour)) %>% group_by(origin, dest) %>% time_episodes(date, time_by = "week", window = 1) # The pooled average time between flights of a specific origin and destination # is ~ 5.2 hours # This average is a weighted average of average time between events # Weighted by the frequency of origin-destination groups (pairs) # It can be calculated like so: # flights %>% # arrange(origin, dest, time_hour) %>% # group_by(origin, dest) %>% # mutate(time_diff = time_diff(lag(time_hour), time_hour, "hours")) %>% # summarise(n = n(), # mean = mean(time_diff, na.rm = TRUE)) %>% # ungroup() %>% # summarise(pooled_mean = weighted.mean(mean, n, na.rm = TRUE)) events episodes <- events %>% filter(ep_id_new > 1) nrow(fdistinct(episodes, origin, dest)) # 55 origin-destinations # As expected summer months saw the least number of # dry-periods episodes %>% ungroup() %>% time_by(ep_start, time_by = "week", .name = "ep_start", as_interval = FALSE) %>% count() %>% ggplot(aes(x = ep_start, y = n)) + geom_bar(stat = "identity")
library(timeplyr) library(dplyr) library(nycflights13) library(lubridate) library(ggplot2) # Say we want to flag origin-destination pairs # that haven't seen departures or arrivals for a week events <- flights %>% mutate(date = as_date(time_hour)) %>% group_by(origin, dest) %>% time_episodes(date, time_by = "week", window = 1) # The pooled average time between flights of a specific origin and destination # is ~ 5.2 hours # This average is a weighted average of average time between events # Weighted by the frequency of origin-destination groups (pairs) # It can be calculated like so: # flights %>% # arrange(origin, dest, time_hour) %>% # group_by(origin, dest) %>% # mutate(time_diff = time_diff(lag(time_hour), time_hour, "hours")) %>% # summarise(n = n(), # mean = mean(time_diff, na.rm = TRUE)) %>% # ungroup() %>% # summarise(pooled_mean = weighted.mean(mean, n, na.rm = TRUE)) events episodes <- events %>% filter(ep_id_new > 1) nrow(fdistinct(episodes, origin, dest)) # 55 origin-destinations # As expected summer months saw the least number of # dry-periods episodes %>% ungroup() %>% time_by(ep_start, time_by = "week", .name = "ep_start", as_interval = FALSE) %>% count() %>% ggplot(aes(x = ep_start, y = n)) + geom_bar(stat = "identity")
tidyr::complete()
.A time based extension to tidyr::complete()
.
time_expand( data, time = NULL, ..., .by = NULL, time_by = NULL, from = NULL, to = NULL, time_type = getOption("timeplyr.time_type", "auto"), time_floor = FALSE, week_start = getOption("lubridate.week.start", 1), expand_type = c("nesting", "crossing"), sort = TRUE, roll_month = getOption("timeplyr.roll_month", "preday"), roll_dst = getOption("timeplyr.roll_dst", "NA") ) time_complete( data, time = NULL, ..., .by = NULL, time_by = NULL, from = NULL, to = NULL, time_type = getOption("timeplyr.time_type", "auto"), time_floor = FALSE, week_start = getOption("lubridate.week.start", 1), expand_type = c("nesting", "crossing"), sort = TRUE, fill = NA, roll_month = getOption("timeplyr.roll_month", "preday"), roll_dst = getOption("timeplyr.roll_dst", "NA") )
time_expand( data, time = NULL, ..., .by = NULL, time_by = NULL, from = NULL, to = NULL, time_type = getOption("timeplyr.time_type", "auto"), time_floor = FALSE, week_start = getOption("lubridate.week.start", 1), expand_type = c("nesting", "crossing"), sort = TRUE, roll_month = getOption("timeplyr.roll_month", "preday"), roll_dst = getOption("timeplyr.roll_dst", "NA") ) time_complete( data, time = NULL, ..., .by = NULL, time_by = NULL, from = NULL, to = NULL, time_type = getOption("timeplyr.time_type", "auto"), time_floor = FALSE, week_start = getOption("lubridate.week.start", 1), expand_type = c("nesting", "crossing"), sort = TRUE, fill = NA, roll_month = getOption("timeplyr.roll_month", "preday"), roll_dst = getOption("timeplyr.roll_dst", "NA") )
data |
A data frame. |
time |
Time variable. |
... |
Groups to expand. |
.by |
(Optional). A selection of columns to group by for this operation. Columns are specified using tidy-select. |
time_by |
Time unit.
|
from |
Time series start date. |
to |
Time series end date. |
time_type |
If "auto", |
time_floor |
Should |
week_start |
day on which week starts following ISO conventions - 1
means Monday (default), 7 means Sunday.
This is only used when |
expand_type |
Type of time expansion to use where "nesting" finds combinations already present in the data, "crossing" finds all combinations of values in the group variables. |
sort |
Logical. If |
roll_month |
Control how impossible dates are handled when
month or year arithmetic is involved.
Options are "preday", "boundary", "postday", "full" and "NA".
See |
roll_dst |
See |
fill |
A named list containing value-name pairs to fill the named implicit missing values. |
This works much the same as tidyr::complete()
, except that
you can supply an additional time
argument to allow for filling in time gaps,
expansion of time, as well as aggregating time to a higher unit.
lubridate
is used for handling time, while data.table
and collapse
are used for
the data frame expansion.
At the moment, within group combinations are ignored. This means when expand_type = nesting
,
existing combinations of supplied groups across the entire dataset are used, and
when expand_type = crossing
, all possible combinations of supplied groups across the entire
dataset are used as well.
A data.frame
of expanded time by or across groups.
library(timeplyr) library(dplyr) library(lubridate) library(nycflights13) x <- flights$time_hour time_num_gaps(x) # Missing hours flights_count <- flights %>% fcount(time_hour) # Fill in missing hours flights_count %>% time_complete(time = time_hour) # You can specify units too flights_count %>% time_complete(time = time_hour, time_by = "hours") flights_count %>% time_complete(time = as_date(time_hour), time_by = "days") # Nothing to complete here # Where time_expand() and time_complete() really shine is how fast they are with groups flights %>% group_by(origin, dest) %>% time_expand(time = time_hour, time_by = dweeks(1))
library(timeplyr) library(dplyr) library(lubridate) library(nycflights13) x <- flights$time_hour time_num_gaps(x) # Missing hours flights_count <- flights %>% fcount(time_hour) # Fill in missing hours flights_count %>% time_complete(time = time_hour) # You can specify units too flights_count %>% time_complete(time = time_hour, time_by = "hours") flights_count %>% time_complete(time = as_date(time_hour), time_by = "days") # Nothing to complete here # Where time_expand() and time_complete() really shine is how fast they are with groups flights %>% group_by(origin, dest) %>% time_expand(time = time_hour, time_by = dweeks(1))
These are atomic vector-based functions of the tidy equivalents which all have a "v" suffix to denote this. These are more geared towards programmers and allow for working with date and datetime vectors.
time_expandv( x, time_by = NULL, from = NULL, to = NULL, g = NULL, use.g.names = TRUE, time_type = getOption("timeplyr.time_type", "auto"), time_floor = FALSE, week_start = getOption("lubridate.week.start", 1), roll_month = getOption("timeplyr.roll_month", "preday"), roll_dst = getOption("timeplyr.roll_dst", "NA") ) time_span( x, time_by = NULL, from = NULL, to = NULL, g = NULL, use.g.names = TRUE, time_type = getOption("timeplyr.time_type", "auto"), time_floor = FALSE, week_start = getOption("lubridate.week.start", 1), roll_month = getOption("timeplyr.roll_month", "preday"), roll_dst = getOption("timeplyr.roll_dst", "NA") ) time_completev( x, time_by = NULL, from = NULL, to = NULL, sort = TRUE, time_type = getOption("timeplyr.time_type", "auto"), time_floor = FALSE, week_start = getOption("lubridate.week.start", 1), roll_month = getOption("timeplyr.roll_month", "preday"), roll_dst = getOption("timeplyr.roll_dst", "NA") ) time_summarisev( x, time_by = NULL, from = NULL, to = NULL, sort = FALSE, unique = FALSE, time_type = getOption("timeplyr.time_type", "auto"), time_floor = FALSE, week_start = getOption("lubridate.week.start", 1), roll_month = getOption("timeplyr.roll_month", "preday"), roll_dst = getOption("timeplyr.roll_dst", "NA"), as_interval = getOption("timeplyr.use_intervals", TRUE) ) time_countv( x, time_by = NULL, from = NULL, to = NULL, sort = TRUE, unique = TRUE, complete = FALSE, time_type = getOption("timeplyr.time_type", "auto"), time_floor = FALSE, week_start = getOption("lubridate.week.start", 1), roll_month = getOption("timeplyr.roll_month", "preday"), roll_dst = getOption("timeplyr.roll_dst", "NA"), as_interval = getOption("timeplyr.use_intervals", TRUE) ) time_span_size( x, time_by = NULL, from = NULL, to = NULL, g = NULL, use.g.names = TRUE, time_type = getOption("timeplyr.time_type", "auto"), time_floor = FALSE, week_start = getOption("lubridate.week.start", 1) )
time_expandv( x, time_by = NULL, from = NULL, to = NULL, g = NULL, use.g.names = TRUE, time_type = getOption("timeplyr.time_type", "auto"), time_floor = FALSE, week_start = getOption("lubridate.week.start", 1), roll_month = getOption("timeplyr.roll_month", "preday"), roll_dst = getOption("timeplyr.roll_dst", "NA") ) time_span( x, time_by = NULL, from = NULL, to = NULL, g = NULL, use.g.names = TRUE, time_type = getOption("timeplyr.time_type", "auto"), time_floor = FALSE, week_start = getOption("lubridate.week.start", 1), roll_month = getOption("timeplyr.roll_month", "preday"), roll_dst = getOption("timeplyr.roll_dst", "NA") ) time_completev( x, time_by = NULL, from = NULL, to = NULL, sort = TRUE, time_type = getOption("timeplyr.time_type", "auto"), time_floor = FALSE, week_start = getOption("lubridate.week.start", 1), roll_month = getOption("timeplyr.roll_month", "preday"), roll_dst = getOption("timeplyr.roll_dst", "NA") ) time_summarisev( x, time_by = NULL, from = NULL, to = NULL, sort = FALSE, unique = FALSE, time_type = getOption("timeplyr.time_type", "auto"), time_floor = FALSE, week_start = getOption("lubridate.week.start", 1), roll_month = getOption("timeplyr.roll_month", "preday"), roll_dst = getOption("timeplyr.roll_dst", "NA"), as_interval = getOption("timeplyr.use_intervals", TRUE) ) time_countv( x, time_by = NULL, from = NULL, to = NULL, sort = TRUE, unique = TRUE, complete = FALSE, time_type = getOption("timeplyr.time_type", "auto"), time_floor = FALSE, week_start = getOption("lubridate.week.start", 1), roll_month = getOption("timeplyr.roll_month", "preday"), roll_dst = getOption("timeplyr.roll_dst", "NA"), as_interval = getOption("timeplyr.use_intervals", TRUE) ) time_span_size( x, time_by = NULL, from = NULL, to = NULL, g = NULL, use.g.names = TRUE, time_type = getOption("timeplyr.time_type", "auto"), time_floor = FALSE, week_start = getOption("lubridate.week.start", 1) )
x |
Time variable. |
time_by |
Time unit.
|
from |
Time series start date. |
to |
Time series end date. |
g |
Grouping object passed directly to |
use.g.names |
Should the result include group names?
Default is |
time_type |
If "auto", |
time_floor |
Should |
week_start |
day on which week starts following ISO conventions - 1
means Monday (default), 7 means Sunday.
This is only used when |
roll_month |
Control how impossible dates are handled when
month or year arithmetic is involved.
Options are "preday", "boundary", "postday", "full" and "NA".
See |
roll_dst |
See |
sort |
Should the output be sorted? Default is |
unique |
Should the result be unique or match the length of the vector?
Default is |
as_interval |
Should result be a |
complete |
Logical. If |
Vectors (typically the same class as x
) of varying lengths depending
on the arguments supplied.
time_countv()
returns a tibble
.
library(timeplyr) library(dplyr) library(lubridate) library(nycflights13) x <- unique(flights$time_hour) # Number of missing hours time_num_gaps(x) # Same as above time_span_size(x) - length(unique(x)) # Time sequence that spans the data length(time_span(x)) # Automatically detects hour granularity time_span(x, time_by = "month") time_span(x, time_by = list("quarters" = 1), to = today(), # Floor start of sequence to nearest month time_floor = TRUE) # Complete missing gaps in time using time_completev y <- time_completev(x, time_by = "hour") identical(y[!y %in% x], time_gaps(x)) # Summarise time using time_summarisev time_summarisev(y, time_by = "quarter") time_summarisev(y, time_by = "quarter", unique = TRUE) flights %>% fcount(quarter = time_summarisev(time_hour, "quarter")) # Alternatively time_countv(flights$time_hour, time_by = "quarter") # If you want the above as an atomic vector just use tibble::deframe
library(timeplyr) library(dplyr) library(lubridate) library(nycflights13) x <- unique(flights$time_hour) # Number of missing hours time_num_gaps(x) # Same as above time_span_size(x) - length(unique(x)) # Time sequence that spans the data length(time_span(x)) # Automatically detects hour granularity time_span(x, time_by = "month") time_span(x, time_by = list("quarters" = 1), to = today(), # Floor start of sequence to nearest month time_floor = TRUE) # Complete missing gaps in time using time_completev y <- time_completev(x, time_by = "hour") identical(y[!y %in% x], time_gaps(x)) # Summarise time using time_summarisev time_summarisev(y, time_by = "quarter") time_summarisev(y, time_by = "quarter", unique = TRUE) flights %>% fcount(quarter = time_summarisev(time_hour, "quarter")) # Alternatively time_countv(flights$time_hour, time_by = "quarter") # If you want the above as an atomic vector just use tibble::deframe
time_gaps()
checks for implicit missing gaps in time for any
regular date or datetime sequence.
time_gaps( x, time_by = NULL, g = NULL, use.g.names = TRUE, time_type = getOption("timeplyr.time_type", "auto"), check_time_regular = FALSE ) time_num_gaps( x, time_by = NULL, g = NULL, use.g.names = TRUE, na.rm = TRUE, time_type = getOption("timeplyr.time_type", "auto"), check_time_regular = FALSE ) time_has_gaps( x, time_by = NULL, g = NULL, use.g.names = TRUE, na.rm = TRUE, time_type = getOption("timeplyr.time_type", "auto"), check_time_regular = FALSE )
time_gaps( x, time_by = NULL, g = NULL, use.g.names = TRUE, time_type = getOption("timeplyr.time_type", "auto"), check_time_regular = FALSE ) time_num_gaps( x, time_by = NULL, g = NULL, use.g.names = TRUE, na.rm = TRUE, time_type = getOption("timeplyr.time_type", "auto"), check_time_regular = FALSE ) time_has_gaps( x, time_by = NULL, g = NULL, use.g.names = TRUE, na.rm = TRUE, time_type = getOption("timeplyr.time_type", "auto"), check_time_regular = FALSE )
x |
A date, datetime or numeric vector. |
time_by |
Time unit.
|
g |
Grouping object passed directly to |
use.g.names |
Should the result include group names?
Default is |
time_type |
Time type, either "auto", "duration" or "period".
With larger data, it is recommended to use |
check_time_regular |
Should the time vector be
checked to see if it is regular (with or without gaps)?
Default is |
na.rm |
Should |
When check_time_regular
is TRUE, x
is passed to
time_is_regular
, which checks that the time elapsed between successive
values are in increasing order and are whole numbers.
For more strict checks, see ?time_is_regular
.
time_gaps
returns a vector of time gaps. time_num_gaps
returns the number of time gaps. time_has_gaps
returns a logical(1) of whether there are gaps.
library(timeplyr) library(dplyr) library(lubridate) library(nycflights13) missing_dates(flights$time_hour) time_has_gaps(flights$time_hour) time_num_gaps(flights$time_hour) time_gaps(flights$time_hour) time_num_gaps(flights$time_hour, g = flights$origin) # Number of missing hours by origin and dest flights %>% group_by(origin, dest) %>% summarise(n_missing = time_num_gaps(time_hour, "hours"))
library(timeplyr) library(dplyr) library(lubridate) library(nycflights13) missing_dates(flights$time_hour) time_has_gaps(flights$time_hour) time_num_gaps(flights$time_hour) time_gaps(flights$time_hour) time_num_gaps(flights$time_hour, g = flights$origin) # Number of missing hours by origin and dest flights %>% group_by(origin, dest) %>% summarise(n_missing = time_num_gaps(time_hour, "hours"))
Fast greatest common divisor of time differences
time_gcd_diff( x, time_by = 1L, time_type = getOption("timeplyr.time_type", "auto"), tol = sqrt(.Machine$double.eps) )
time_gcd_diff( x, time_by = 1L, time_type = getOption("timeplyr.time_type", "auto"), tol = sqrt(.Machine$double.eps) )
x |
Time variable. |
time_by |
Time unit.
|
time_type |
If "auto", |
tol |
Numeric tolerance for gcd algorithm. |
A list of length 1.
library(timeplyr) library(lubridate) library(cppdoubles) time_gcd_diff(1:10) time_gcd_diff(seq(0, 1, 0.2)) time_gcd_diff(time_seq(today(), today() + 100, time_by = "3 days")) time_gcd_diff(time_seq(now(), len = 10^2, time_by = "125 seconds")) # Monthly gcd using lubridate periods quarter_seq <- time_seq(today(), len = 24, time_by = months(4)) time_gcd_diff(quarter_seq, time_by = months(1), time_type = "period") time_gcd_diff(quarter_seq, time_by = "months", time_type = "duration") # Detects monthly granularity double_equal(time_gcd_diff(as.vector(time(AirPassengers))), 1/12)
library(timeplyr) library(lubridate) library(cppdoubles) time_gcd_diff(1:10) time_gcd_diff(seq(0, 1, 0.2)) time_gcd_diff(time_seq(today(), today() + 100, time_by = "3 days")) time_gcd_diff(time_seq(now(), len = 10^2, time_by = "125 seconds")) # Monthly gcd using lubridate periods quarter_seq <- time_seq(today(), len = 24, time_by = months(4)) time_gcd_diff(quarter_seq, time_by = months(1), time_type = "period") time_gcd_diff(quarter_seq, time_by = "months", time_type = "duration") # Detects monthly granularity double_equal(time_gcd_diff(as.vector(time(AirPassengers))), 1/12)
time_ggplot()
is a neat way to quickly
plot aggregate time-series data.
time_ggplot( data, time, value, group = NULL, facet = FALSE, geom = ggplot2::geom_line, ... )
time_ggplot( data, time, value, group = NULL, facet = FALSE, geom = ggplot2::geom_line, ... )
data |
A data frame |
time |
Time variable using |
value |
Value variable using |
group |
(Optional) Group variable using |
facet |
When groups are supplied, should multi-series be
plotted separately or on the same plot?
Default is |
geom |
|
... |
Further arguments passed to the chosen 'geom'. |
A ggplot
.
library(dplyr) library(timeplyr) library(ggplot2) library(lubridate) # It's as easy as this AirPassengers %>% ts_as_tibble() %>% time_ggplot(time, value) # And this EuStockMarkets %>% ts_as_tibble() %>% time_ggplot(time, value, group) # zoo example x.Date <- as.Date("2003-02-01") + c(1, 3, 7, 9, 14) - 1 x <- zoo::zoo(rnorm(5), x.Date) x %>% ts_as_tibble() %>% time_ggplot(time, value) # An example using raw data ebola <- outbreaks::ebola_sim$linelist # We can build a helper to count and complete # Using the same time grid count_and_complete <- function(.data, time, .name, from = NULL, ..., time_by = NULL){ .data %>% time_by(!!dplyr::enquo(time), time_by = time_by, .name = .name, from = !!dplyr::enquo(from), as_interval = FALSE) %>% dplyr::count(...) %>% dplyr::ungroup() %>% time_complete(.data[[.name]], ..., time_by = time_by, fill = list(n = 0)) } ebola %>% count_and_complete(date_of_onset, outcome, time_by = "week", .name = "date_of_onset", from = floor_date(min(date_of_onset), "week")) %>% time_ggplot(date_of_onset, n, geom = geom_blank) + geom_col(aes(fill = outcome))
library(dplyr) library(timeplyr) library(ggplot2) library(lubridate) # It's as easy as this AirPassengers %>% ts_as_tibble() %>% time_ggplot(time, value) # And this EuStockMarkets %>% ts_as_tibble() %>% time_ggplot(time, value, group) # zoo example x.Date <- as.Date("2003-02-01") + c(1, 3, 7, 9, 14) - 1 x <- zoo::zoo(rnorm(5), x.Date) x %>% ts_as_tibble() %>% time_ggplot(time, value) # An example using raw data ebola <- outbreaks::ebola_sim$linelist # We can build a helper to count and complete # Using the same time grid count_and_complete <- function(.data, time, .name, from = NULL, ..., time_by = NULL){ .data %>% time_by(!!dplyr::enquo(time), time_by = time_by, .name = .name, from = !!dplyr::enquo(from), as_interval = FALSE) %>% dplyr::count(...) %>% dplyr::ungroup() %>% time_complete(.data[[.name]], ..., time_by = time_by, fill = list(n = 0)) } ebola %>% count_and_complete(date_of_onset, outcome, time_by = "week", .name = "date_of_onset", from = floor_date(min(date_of_onset), "week")) %>% time_ggplot(date_of_onset, n, geom = geom_blank) + geom_col(aes(fill = outcome))
Generate a time ID that signifies how many time steps away a time value is from the starting time point or more intuitively, this is the time passed since the first time point.
time_id( x, time_by = NULL, g = NULL, na_skip = TRUE, time_type = getOption("timeplyr.time_type", "auto"), shift = 1L )
time_id( x, time_by = NULL, g = NULL, na_skip = TRUE, time_type = getOption("timeplyr.time_type", "auto"), shift = 1L )
x |
Time variable. |
time_by |
Time unit.
|
g |
Object used for grouping x.
This can for example be a vector or data frame.
|
na_skip |
Should |
time_type |
If "auto", |
shift |
Value used to shift the time IDs. Typically this is 1 to ensure the IDs start at 1 but can be 0 or even negative if for example your time values are going backwards in time. |
This is heavily inspired by collapse::timeid
but differs in 3 ways:
The time steps need not be the greatest common divisor of successive differences
The starting time point may not necessarily
be the earliest chronologically and thus time_id
can generate negative IDs.
g
can be supplied to calculate IDs by group.
time_id(c(3, 2, 1))
is not the same as collapse::timeid(c(3, 2, 1))
.
In general time_id(sort(x))
should be equal to collapse::timeid(sort(x))
.
The time difference GCD is always calculated using all the data and not
by-group.
An integer vector the same length as x
.
Inspired by both 'lubridate' and 'ivs', time_interval
is a 'vctrs' style
class for right-open intervals that contain a vector of start dates and end dates.
time_interval(start = integer(), end = integer()) is_time_interval(x)
time_interval(start = integer(), end = integer()) is_time_interval(x)
start |
Start time. |
end |
End time. |
x |
A 'time_interval'. |
In the near-future, all time aggregated variables will utilise these intervals. One can control the appearance of the intervals through the "timeplyr.interval_style" option. For example:
options(timeplyr.interval_style = "full")
- Full interval format.
options(timeplyr.interval_style = "start")
- Start time of the interval.
options(timeplyr.interval_style = "end")
- end time of the interval.
Representing time using intervals is natural because when one talks about a day or an hour, they are implicitly referring to an interval of time. Even a unit as small as a second is just an interval and therefore base R objects like Dates and POSIXcts are also intervals.
An object of class time_interval
. is_time_interval
returns a logical of length 1. interval_start
returns the start times. interval_end
returns the end times. interval_count
returns a data frame of unique intervals and their counts.
library(dplyr) library(timeplyr) library(lubridate) x <- 1:10 int <- time_interval(x, 100) options(timeplyr.interval_style = "full") int # Displaying the start or end values of the intervals format(int, "start") format(int, "end") month_start <- floor_date(today(), unit = "months") month_int <- time_interval(month_start, month_start + months(1)) month_int # Custom format function for start and end dates format(month_int, interval_sub_formatter = function(x) format(x, format = "%Y/%B")) format(month_int, interval_style = "start", interval_sub_formatter = function(x) format(x, format = "%Y/%B")) # Advanced formatting # As shown above, we can specify formatting functions for the dates # in our intervals # Sometimes it's useful to set a default function options(timeplyr.interval_sub_formatter = function(x) format(x, format = "%b %Y")) month_int # Divide an interval into different time units time_interval(today(), today() + years(0:10)) / "years" time_interval(today(), today() + dyears(0:10)) / ddays(365.25) time_interval(today(), today() + years(0:10)) / "months" time_interval(today(), today() + years(0:10)) / "weeks" time_interval(today(), today() + years(0:10)) / "7 days" time_interval(today(), today() + years(0:10)) / "24 hours" time_interval(today(), today() + years(0:10)) / "minutes" time_interval(today(), today() + years(0:10)) / "seconds" time_interval(today(), today() + years(0:10)) / "milliseconds" # Cutting Sepal Length into blocks of width 1 int <- time_aggregate(iris$Sepal.Length, time_by = 1, as_interval = TRUE) int %>% interval_count() reset_timeplyr_options()
library(dplyr) library(timeplyr) library(lubridate) x <- 1:10 int <- time_interval(x, 100) options(timeplyr.interval_style = "full") int # Displaying the start or end values of the intervals format(int, "start") format(int, "end") month_start <- floor_date(today(), unit = "months") month_int <- time_interval(month_start, month_start + months(1)) month_int # Custom format function for start and end dates format(month_int, interval_sub_formatter = function(x) format(x, format = "%Y/%B")) format(month_int, interval_style = "start", interval_sub_formatter = function(x) format(x, format = "%Y/%B")) # Advanced formatting # As shown above, we can specify formatting functions for the dates # in our intervals # Sometimes it's useful to set a default function options(timeplyr.interval_sub_formatter = function(x) format(x, format = "%b %Y")) month_int # Divide an interval into different time units time_interval(today(), today() + years(0:10)) / "years" time_interval(today(), today() + dyears(0:10)) / ddays(365.25) time_interval(today(), today() + years(0:10)) / "months" time_interval(today(), today() + years(0:10)) / "weeks" time_interval(today(), today() + years(0:10)) / "7 days" time_interval(today(), today() + years(0:10)) / "24 hours" time_interval(today(), today() + years(0:10)) / "minutes" time_interval(today(), today() + years(0:10)) / "seconds" time_interval(today(), today() + years(0:10)) / "milliseconds" # Cutting Sepal Length into blocks of width 1 int <- time_aggregate(iris$Sepal.Length, time_by = 1, as_interval = TRUE) int %>% interval_count() reset_timeplyr_options()
This function is a fast way to check if a time vector
is a regular sequence, possibly for many groups.
Regular in this context means that the lagged time differences are a
whole multiple of the specified time unit.
This means x
can be a regular sequence with or without gaps in time.
time_is_regular( x, time_by = NULL, g = NULL, use.g.names = TRUE, na.rm = TRUE, time_type = getOption("timeplyr.time_type", "auto"), allow_gaps = TRUE, allow_dups = TRUE )
time_is_regular( x, time_by = NULL, g = NULL, use.g.names = TRUE, na.rm = TRUE, time_type = getOption("timeplyr.time_type", "auto"), allow_gaps = TRUE, allow_dups = TRUE )
x |
A vector. Can be a
|
time_by |
Time unit.
|
g |
Grouping object passed directly to |
use.g.names |
Should the result include group names?
Default is |
na.rm |
Should |
time_type |
If "auto", |
allow_gaps |
Should gaps be allowed? Default is |
allow_dups |
Should duplicates be allowed? Default is |
A logical vector the same length as the number of supplied groups.
library(timeplyr) library(lubridate) library(dplyr) x <- 1:5 y <- c(1, 1, 2, 3, 5) time_is_regular(x) time_is_regular(y) increment <- 1 # No duplicates allowed time_is_regular(x, increment, allow_dups = FALSE) time_is_regular(y, increment, allow_dups = FALSE) # No gaps allowed time_is_regular(x, increment, allow_gaps = FALSE) time_is_regular(y, increment, allow_gaps = FALSE) # Grouped eu_stock <- ts_as_tibble(EuStockMarkets) eu_stock <- eu_stock %>% mutate(date = as_date( date_decimal(time) )) time_is_regular(eu_stock$date, g = eu_stock$group, time_by = 1) # This makes sense as no trading occurs on weekends and holidays time_is_regular(eu_stock$date, g = eu_stock$group, time_by = 1, allow_gaps = FALSE)
library(timeplyr) library(lubridate) library(dplyr) x <- 1:5 y <- c(1, 1, 2, 3, 5) time_is_regular(x) time_is_regular(y) increment <- 1 # No duplicates allowed time_is_regular(x, increment, allow_dups = FALSE) time_is_regular(y, increment, allow_dups = FALSE) # No gaps allowed time_is_regular(x, increment, allow_gaps = FALSE) time_is_regular(y, increment, allow_gaps = FALSE) # Grouped eu_stock <- ts_as_tibble(EuStockMarkets) eu_stock <- eu_stock %>% mutate(date = as_date( date_decimal(time) )) time_is_regular(eu_stock$date, g = eu_stock$group, time_by = 1) # This makes sense as no trading occurs on weekends and holidays time_is_regular(eu_stock$date, g = eu_stock$group, time_by = 1, allow_gaps = FALSE)
time_roll_sum
and time_roll_mean
are efficient
methods for calculating a rolling sum and mean respectively given
many groups and with respect to a date or datetime time index.
It is always aligned "right". time_roll_window
splits x
into windows based on the index. time_roll_window_size
returns the window sizes for all indices of x
. time_roll_apply
is a generic function that applies any function
on a rolling basis with respect to a time index.
time_roll_growth_rate
can efficiently calculate by-group
rolling growth rates with respect to a date/datetime index.
time_roll_sum( x, window = Inf, time = seq_along(x), weights = NULL, g = NULL, partial = TRUE, close_left_boundary = FALSE, na.rm = TRUE, time_type = getOption("timeplyr.time_type", "auto"), roll_month = getOption("timeplyr.roll_month", "preday"), roll_dst = getOption("timeplyr.roll_dst", "NA"), ... ) time_roll_mean( x, window = Inf, time = seq_along(x), weights = NULL, g = NULL, partial = TRUE, close_left_boundary = FALSE, na.rm = TRUE, time_type = getOption("timeplyr.time_type", "auto"), roll_month = getOption("timeplyr.roll_month", "preday"), roll_dst = getOption("timeplyr.roll_dst", "NA"), ... ) time_roll_growth_rate( x, window = Inf, time = seq_along(x), time_step = NULL, g = NULL, partial = TRUE, close_left_boundary = FALSE, na.rm = TRUE, time_type = getOption("timeplyr.time_type", "auto"), roll_month = getOption("timeplyr.roll_month", "preday"), roll_dst = getOption("timeplyr.roll_dst", "NA") ) time_roll_window_size( time, window = Inf, g = NULL, partial = TRUE, close_left_boundary = FALSE, time_type = getOption("timeplyr.time_type", "auto"), roll_month = getOption("timeplyr.roll_month", "preday"), roll_dst = getOption("timeplyr.roll_dst", "NA") ) time_roll_window( x, window = Inf, time = seq_along(x), g = NULL, partial = TRUE, close_left_boundary = FALSE, time_type = getOption("timeplyr.time_type", "auto"), roll_month = getOption("timeplyr.roll_month", "preday"), roll_dst = getOption("timeplyr.roll_dst", "NA") ) time_roll_apply( x, window = Inf, fun, time = seq_along(x), g = NULL, partial = TRUE, unlist = FALSE, close_left_boundary = FALSE, time_type = getOption("timeplyr.time_type", "auto"), roll_month = getOption("timeplyr.roll_month", "preday"), roll_dst = getOption("timeplyr.roll_dst", "NA") )
time_roll_sum( x, window = Inf, time = seq_along(x), weights = NULL, g = NULL, partial = TRUE, close_left_boundary = FALSE, na.rm = TRUE, time_type = getOption("timeplyr.time_type", "auto"), roll_month = getOption("timeplyr.roll_month", "preday"), roll_dst = getOption("timeplyr.roll_dst", "NA"), ... ) time_roll_mean( x, window = Inf, time = seq_along(x), weights = NULL, g = NULL, partial = TRUE, close_left_boundary = FALSE, na.rm = TRUE, time_type = getOption("timeplyr.time_type", "auto"), roll_month = getOption("timeplyr.roll_month", "preday"), roll_dst = getOption("timeplyr.roll_dst", "NA"), ... ) time_roll_growth_rate( x, window = Inf, time = seq_along(x), time_step = NULL, g = NULL, partial = TRUE, close_left_boundary = FALSE, na.rm = TRUE, time_type = getOption("timeplyr.time_type", "auto"), roll_month = getOption("timeplyr.roll_month", "preday"), roll_dst = getOption("timeplyr.roll_dst", "NA") ) time_roll_window_size( time, window = Inf, g = NULL, partial = TRUE, close_left_boundary = FALSE, time_type = getOption("timeplyr.time_type", "auto"), roll_month = getOption("timeplyr.roll_month", "preday"), roll_dst = getOption("timeplyr.roll_dst", "NA") ) time_roll_window( x, window = Inf, time = seq_along(x), g = NULL, partial = TRUE, close_left_boundary = FALSE, time_type = getOption("timeplyr.time_type", "auto"), roll_month = getOption("timeplyr.roll_month", "preday"), roll_dst = getOption("timeplyr.roll_dst", "NA") ) time_roll_apply( x, window = Inf, fun, time = seq_along(x), g = NULL, partial = TRUE, unlist = FALSE, close_left_boundary = FALSE, time_type = getOption("timeplyr.time_type", "auto"), roll_month = getOption("timeplyr.roll_month", "preday"), roll_dst = getOption("timeplyr.roll_dst", "NA") )
x |
Numeric vector. |
window |
Time window size (Default is
|
time |
(Optional) time index. |
weights |
Importance weights. Must be the same length as x. Currently, no normalisation of weights occurs. |
g |
Grouping object passed directly to |
partial |
Should calculations be done using partial windows?
Default is |
close_left_boundary |
Should the left boundary be closed?
For example, if you specify |
na.rm |
Should missing values be removed for the calculation?
The default is |
time_type |
If "auto", |
roll_month |
Control how impossible dates are handled when
month or year arithmetic is involved.
Options are "preday", "boundary", "postday", "full" and "NA".
See |
roll_dst |
See |
... |
Additional arguments passed to |
time_step |
An optional but important argument
that follows the same input rules as |
fun |
A function. |
unlist |
Should the output of |
It is much faster if your data are already sorted such that
!is.unsorted(order(g, x))
is TRUE
.
For growth rates across time, one can use time_step
to incorporate
gaps in time into the calculation.
For example: x <- c(10, 20)
t <- c(1, 10)
k <- Inf
time_roll_growth_rate(x, time = t, window = k)
= c(1, 2)
whereas time_roll_growth_rate(x, time = t, window = k, time_step = 1)
= c(1, 1.08)
The first is a doubling from 10 to 20, whereas the second implies a growth of
8% for each time step from 1 to 10.
This allows us for example to calculate daily growth rates over the last x months,
even with missing days.
A vector the same length as time
.
library(timeplyr) library(lubridate) library(dplyr) time <- time_seq(today(), today() + weeks(3), time_by = "3 days") set.seed(99) x <- sample.int(length(time)) roll_mean(x, window = 7) roll_sum(x, window = 7) time_roll_mean(x, window = ddays(7), time = time) time_roll_sum(x, window = days(7), time = time) # Alternatively and more verbosely x_chunks <- time_roll_window(x, window = 7, time = time) x_chunks vapply(x_chunks, mean, 0) # Interval (x - 3 x] time_roll_sum(x, window = ddays(3), time = time) # An example with an irregular time series t <- today() + days(sort(sample(1:30, 20, TRUE))) time_elapsed(t, days(1)) # See the irregular elapsed time x <- rpois(length(t), 10) tibble(x, t) %>% mutate(sum = time_roll_sum(x, time = t, window = days(3))) %>% time_ggplot(t, sum) ### Rolling mean example with many time series # Sparse time with duplicates index <- sort(sample(seq(now(), now() + dyears(3), by = "333 hours"), 250, TRUE)) x <- matrix(rnorm(length(index) * 10^3), ncol = 10^3, nrow = length(index), byrow = FALSE) zoo_ts <- zoo::zoo(x, order.by = index) # Normally you might attempt something like this apply(x, 2, function(x){ time_roll_mean(x, window = dmonths(1), time = index) } ) # Unfortunately this is too slow and inefficient # Instead we can pivot it longer and code each series as a separate group tbl <- ts_as_tibble(zoo_ts) tbl %>% mutate(monthly_mean = time_roll_mean(value, window = dmonths(1), time = time, g = group))
library(timeplyr) library(lubridate) library(dplyr) time <- time_seq(today(), today() + weeks(3), time_by = "3 days") set.seed(99) x <- sample.int(length(time)) roll_mean(x, window = 7) roll_sum(x, window = 7) time_roll_mean(x, window = ddays(7), time = time) time_roll_sum(x, window = days(7), time = time) # Alternatively and more verbosely x_chunks <- time_roll_window(x, window = 7, time = time) x_chunks vapply(x_chunks, mean, 0) # Interval (x - 3 x] time_roll_sum(x, window = ddays(3), time = time) # An example with an irregular time series t <- today() + days(sort(sample(1:30, 20, TRUE))) time_elapsed(t, days(1)) # See the irregular elapsed time x <- rpois(length(t), 10) tibble(x, t) %>% mutate(sum = time_roll_sum(x, time = t, window = days(3))) %>% time_ggplot(t, sum) ### Rolling mean example with many time series # Sparse time with duplicates index <- sort(sample(seq(now(), now() + dyears(3), by = "333 hours"), 250, TRUE)) x <- matrix(rnorm(length(index) * 10^3), ncol = 10^3, nrow = length(index), byrow = FALSE) zoo_ts <- zoo::zoo(x, order.by = index) # Normally you might attempt something like this apply(x, 2, function(x){ time_roll_mean(x, window = dmonths(1), time = index) } ) # Unfortunately this is too slow and inefficient # Instead we can pivot it longer and code each series as a separate group tbl <- ts_as_tibble(zoo_ts) tbl %>% mutate(monthly_mean = time_roll_mean(value, window = dmonths(1), time = time, g = group))
base::seq()
Time based version of base::seq()
time_seq( from, to, time_by, length.out = NULL, time_type = getOption("timeplyr.time_type", "auto"), week_start = getOption("lubridate.week.start", 1), time_floor = FALSE, roll_month = getOption("timeplyr.roll_month", "preday"), roll_dst = getOption("timeplyr.roll_dst", "NA") ) time_seq_sizes( from, to, time_by, time_type = getOption("timeplyr.time_type", "auto") ) time_seq_v( from, to, time_by, time_type = getOption("timeplyr.time_type", "auto"), roll_month = getOption("timeplyr.roll_month", "preday"), roll_dst = getOption("timeplyr.roll_dst", "NA"), time_floor = FALSE, week_start = getOption("lubridate.week.start", 1) ) time_seq_v2( sizes, from, time_by, time_type = getOption("timeplyr.time_type", "auto"), time_floor = FALSE, week_start = getOption("lubridate.week.start", 1), roll_month = getOption("timeplyr.roll_month", "preday"), roll_dst = getOption("timeplyr.roll_dst", "NA") )
time_seq( from, to, time_by, length.out = NULL, time_type = getOption("timeplyr.time_type", "auto"), week_start = getOption("lubridate.week.start", 1), time_floor = FALSE, roll_month = getOption("timeplyr.roll_month", "preday"), roll_dst = getOption("timeplyr.roll_dst", "NA") ) time_seq_sizes( from, to, time_by, time_type = getOption("timeplyr.time_type", "auto") ) time_seq_v( from, to, time_by, time_type = getOption("timeplyr.time_type", "auto"), roll_month = getOption("timeplyr.roll_month", "preday"), roll_dst = getOption("timeplyr.roll_dst", "NA"), time_floor = FALSE, week_start = getOption("lubridate.week.start", 1) ) time_seq_v2( sizes, from, time_by, time_type = getOption("timeplyr.time_type", "auto"), time_floor = FALSE, week_start = getOption("lubridate.week.start", 1), roll_month = getOption("timeplyr.roll_month", "preday"), roll_dst = getOption("timeplyr.roll_dst", "NA") )
from |
Start date/datetime of sequence. |
to |
End date/datetime of sequence. |
time_by |
Time unit increment.
|
length.out |
Length of the sequence. |
time_type |
If "auto", |
week_start |
day on which week starts following ISO conventions - 1
means Monday (default), 7 means Sunday.
This is only used when |
time_floor |
Should |
roll_month |
Control how impossible dates are handled when
month or year arithmetic is involved.
Options are "preday", "boundary", "postday", "full" and "NA".
See |
roll_dst |
See |
sizes |
Time sequence sizes. |
This works like seq()
,
but using timechange
for the period calculations and
base::seq.POSIXT()
for the duration calculations.
In many ways it is improved over seq
as
dates and/or datetimes can be supplied with no errors to
the start and end points.
Examples like,time_seq(now(), length.out = 10, by = "0.5 days", seq_type = "dur")
and time_seq(today(), length.out = 10, by = "0.5 days", seq_type = "dur")
produce more expected results compared to seq(now(), length.out = 10, by = "0.5 days")
or seq(today(), length.out = 10, by = "0.5 days")
.
For a vectorized implementation with multiple start/end times,
use time_seq_v()
/time_seq_v2()
time_seq_sizes()
is a convenience
function to calculate time sequence lengths, given start/end times.
time_seq
returns a time sequence. time_seq_sizes
returns an integer vector of sequence sizes. time_seq_v
returns time sequences. time_seq_v2
also returns time sequences.
library(timeplyr) library(lubridate) # Dates today <- today() now <- now() time_seq(today, today + years(1), time_by = "day") time_seq(today, length.out = 10, time_by = "day") time_seq(today, length.out = 10, time_by = "hour") time_seq(today, today + years(1), time_by = list("days" = 1)) # Alternative time_seq(today, today + years(1), time_by = "week") time_seq(today, today + years(1), time_by = "fortnight") time_seq(today, today + years(1), time_by = "year") time_seq(today, today + years(10), time_by = "year") time_seq(today, today + years(100), time_by = "decade") # Datetimes time_seq(now, now + years(1), time_by = "12 hours") time_seq(now, now + years(1), time_by = "day") time_seq(now, now + years(1), time_by = "week") time_seq(now, now + years(1), time_by = "fortnight") time_seq(now, now + years(1), time_by = "year") time_seq(now, now + years(10), time_by = "year") time_seq(now, today + years(100), time_by = "decade") # You can seamlessly mix dates and datetimes with no errors. time_seq(now, today + days(3), time_by = "day") time_seq(now, today + days(3), time_by = "hour") time_seq(today, now + days(3), time_by = "day") time_seq(today, now + days(3), time_by = "hour") # Choose between durations or periods start <- dmy(31012020) # If time_type is left as is, # periods are used for days, weeks, months and years. time_seq(start, time_by = "month", length.out = 12, time_type = "period") time_seq(start, time_by = "month", length.out = 12, time_type = "duration") # Notice how strange base R version is. seq(start, by = "month", length.out = 12) # Roll forward or backward impossible dates leap <- dmy(29022020) # Leap day end <- dmy(01032021) # 3 different options time_seq(leap, to = end, time_by = "year", roll_month = "NA") time_seq(leap, to = end, time_by = "year", roll_month = "postday") time_seq(leap, to = end, time_by = "year", roll_month = getOption("timeplyr.roll_month", "preday"))
library(timeplyr) library(lubridate) # Dates today <- today() now <- now() time_seq(today, today + years(1), time_by = "day") time_seq(today, length.out = 10, time_by = "day") time_seq(today, length.out = 10, time_by = "hour") time_seq(today, today + years(1), time_by = list("days" = 1)) # Alternative time_seq(today, today + years(1), time_by = "week") time_seq(today, today + years(1), time_by = "fortnight") time_seq(today, today + years(1), time_by = "year") time_seq(today, today + years(10), time_by = "year") time_seq(today, today + years(100), time_by = "decade") # Datetimes time_seq(now, now + years(1), time_by = "12 hours") time_seq(now, now + years(1), time_by = "day") time_seq(now, now + years(1), time_by = "week") time_seq(now, now + years(1), time_by = "fortnight") time_seq(now, now + years(1), time_by = "year") time_seq(now, now + years(10), time_by = "year") time_seq(now, today + years(100), time_by = "decade") # You can seamlessly mix dates and datetimes with no errors. time_seq(now, today + days(3), time_by = "day") time_seq(now, today + days(3), time_by = "hour") time_seq(today, now + days(3), time_by = "day") time_seq(today, now + days(3), time_by = "hour") # Choose between durations or periods start <- dmy(31012020) # If time_type is left as is, # periods are used for days, weeks, months and years. time_seq(start, time_by = "month", length.out = 12, time_type = "period") time_seq(start, time_by = "month", length.out = 12, time_type = "duration") # Notice how strange base R version is. seq(start, by = "month", length.out = 12) # Roll forward or backward impossible dates leap <- dmy(29022020) # Leap day end <- dmy(01032021) # 3 different options time_seq(leap, to = end, time_by = "year", roll_month = "NA") time_seq(leap, to = end, time_by = "year", roll_month = "postday") time_seq(leap, to = end, time_by = "year", roll_month = getOption("timeplyr.roll_month", "preday"))
A unique identifier is created every time a specified amount of time has passed, or in the case of regular sequences, when there is a gap in time.
time_seq_id( x, time_by = NULL, threshold = 1, g = NULL, na_skip = TRUE, rolling = TRUE, switch_on_boundary = FALSE, time_type = getOption("timeplyr.time_type", "auto") )
time_seq_id( x, time_by = NULL, threshold = 1, g = NULL, na_skip = TRUE, rolling = TRUE, switch_on_boundary = FALSE, time_type = getOption("timeplyr.time_type", "auto") )
x |
Date, datetime or numeric vector. |
time_by |
Time unit.
|
threshold |
Threshold such that when the time elapsed
exceeds this, the sequence ID is incremented by 1.
For example, if |
g |
Object used for grouping x.
This can for example be a vector or data frame.
|
na_skip |
Should |
rolling |
When this is |
switch_on_boundary |
When an exact amount of time
(specified in |
time_type |
If "auto", |
time_seq_id()
Assumes x
is regular and in
ascending or descending order.
To check this condition formally, use time_is_regular()
.
An integer vector of length(x)
.
library(dplyr) library(timeplyr) library(lubridate) # Weekly sequence, with 2 gaps in between x <- time_seq(today(), length.out = 10, time_by = "week") x <- x[-c(3, 7)] # A new ID when more than a week has passed since the last time point time_seq_id(x, time_by = "week") # A new ID when >= 2 weeks has passed since the last time point time_seq_id(x, time_by = "weeks", threshold = 2, switch_on_boundary = TRUE) # A new ID when at least 4 cumulative weeks have passed time_seq_id(x, time_by = "4 weeks", switch_on_boundary = TRUE, rolling = FALSE) # A new ID when more than 4 cumulative weeks have passed time_seq_id(x, time_by = "4 weeks", switch_on_boundary = FALSE, rolling = FALSE)
library(dplyr) library(timeplyr) library(lubridate) # Weekly sequence, with 2 gaps in between x <- time_seq(today(), length.out = 10, time_by = "week") x <- x[-c(3, 7)] # A new ID when more than a week has passed since the last time point time_seq_id(x, time_by = "week") # A new ID when >= 2 weeks has passed since the last time point time_seq_id(x, time_by = "weeks", threshold = 2, switch_on_boundary = TRUE) # A new ID when at least 4 cumulative weeks have passed time_seq_id(x, time_by = "4 weeks", switch_on_boundary = TRUE, rolling = FALSE) # A new ID when more than 4 cumulative weeks have passed time_seq_id(x, time_by = "4 weeks", switch_on_boundary = FALSE, rolling = FALSE)
Additional scales and transforms for use with year_months and year_quarters in ggplot2.
transform_year_month() transform_year_quarter() scale_x_year_month(...) scale_x_year_quarter(...) scale_y_year_month(...) scale_y_year_quarter(...)
transform_year_month() transform_year_quarter() scale_x_year_month(...) scale_x_year_quarter(...) scale_y_year_month(...) scale_y_year_quarter(...)
... |
Arguments passed to |
A ggplot2 scale or transform.
ts
into a tibble
While a method already exists in the tibble
package,
this method works differently in 2 ways:
The time variable associated with the time-series is also returned.
The returned tibble
is always in long format, even when the time-series
is multivariate.
ts_as_tibble(x, name = "time", value = "value", group = "group") ## Default S3 method: ts_as_tibble(x, name = "time", value = "value", group = "group") ## S3 method for class 'mts' ts_as_tibble(x, name = "time", value = "value", group = "group") ## S3 method for class 'xts' ts_as_tibble(x, name = "time", value = "value", group = "group") ## S3 method for class 'zoo' ts_as_tibble(x, name = "time", value = "value", group = "group") ## S3 method for class 'timeSeries' ts_as_tibble(x, name = "time", value = "value", group = "group")
ts_as_tibble(x, name = "time", value = "value", group = "group") ## Default S3 method: ts_as_tibble(x, name = "time", value = "value", group = "group") ## S3 method for class 'mts' ts_as_tibble(x, name = "time", value = "value", group = "group") ## S3 method for class 'xts' ts_as_tibble(x, name = "time", value = "value", group = "group") ## S3 method for class 'zoo' ts_as_tibble(x, name = "time", value = "value", group = "group") ## S3 method for class 'timeSeries' ts_as_tibble(x, name = "time", value = "value", group = "group")
x |
An object of class |
name |
Name of the output time column. |
value |
Name of the output value column. |
group |
Name of the output group column when there are multiple series. |
A 2-column tibble
containing the time index and values for each
time index. In the case where there are multiple series, this becomes
a 3-column tibble
with an additional "group" column added.
library(timeplyr) library(ggplot2) library(dplyr) # Using the examples from ?ts # Univariate uts <- ts(cumsum(1 + round(rnorm(100), 2)), start = c(1954, 7), frequency = 12) uts_tbl <- ts_as_tibble(uts) ## Multivariate mts <- ts(matrix(rnorm(300), 100, 3), start = c(1961, 1), frequency = 12) mts_tbl <- ts_as_tibble(mts) uts_tbl %>% time_ggplot(time, value) mts_tbl %>% time_ggplot(time, value, group, facet = TRUE) # zoo example x.Date <- as.Date("2003-02-01") + c(1, 3, 7, 9, 14) - 1 x <- zoo::zoo(rnorm(5), x.Date) ts_as_tibble(x) x <- zoo::zoo(matrix(1:12, 4, 3), as.Date("2003-01-01") + 0:3) ts_as_tibble(x)
library(timeplyr) library(ggplot2) library(dplyr) # Using the examples from ?ts # Univariate uts <- ts(cumsum(1 + round(rnorm(100), 2)), start = c(1954, 7), frequency = 12) uts_tbl <- ts_as_tibble(uts) ## Multivariate mts <- ts(matrix(rnorm(300), 100, 3), start = c(1961, 1), frequency = 12) mts_tbl <- ts_as_tibble(mts) uts_tbl %>% time_ggplot(time, value) mts_tbl %>% time_ggplot(time, value, group, facet = TRUE) # zoo example x.Date <- as.Date("2003-02-01") + c(1, 3, 7, 9, 14) - 1 x <- zoo::zoo(rnorm(5), x.Date) ts_as_tibble(x) x <- zoo::zoo(matrix(1:12, 4, 3), as.Date("2003-01-01") + 0:3) ts_as_tibble(x)
This is a simple R function to convert time units to a
common unit, with number and scale.
See .time_units
for a list of accepted
time units.
unit_guess(x)
unit_guess(x)
x |
This can be 1 of 4 options:
|
A list of length 3, including the unit, number and scale.
library(timeplyr) # Single units unit_guess("days") unit_guess("hours") # Multi-units unit_guess("7 days") unit_guess("0.5 hours") # Negative units unit_guess("-7 days") unit_guess("-.12 days") # Exotic units unit_guess("fortnights") unit_guess("decades") .extra_time_units # list input is accepted unit_guess(list("months" = 12)) # With a list, a vector of numbers is accepted unit_guess(list("months" = 1:10)) unit_guess(list("days" = -10:10 %% 7)) # Numbers also accepted unit_guess(100)
library(timeplyr) # Single units unit_guess("days") unit_guess("hours") # Multi-units unit_guess("7 days") unit_guess("0.5 hours") # Negative units unit_guess("-7 days") unit_guess("-.12 days") # Exotic units unit_guess("fortnights") unit_guess("decades") .extra_time_units # list input is accepted unit_guess(list("months" = 12)) # With a list, a vector of numbers is accepted unit_guess(list("months" = 1:10)) unit_guess(list("days" = -10:10 %% 7)) # Numbers also accepted unit_guess(100)
These are experimental methods for working with year-months and year-quarters inspired by 'zoo' and 'tsibble'.
year_month(x) year_quarter(x) YM(length = 0L) year_month_decimal(x) decimal_year_month(x) YQ(length = 0L) year_quarter_decimal(x) decimal_year_quarter(x)
year_month(x) year_quarter(x) YM(length = 0L) year_month_decimal(x) decimal_year_month(x) YQ(length = 0L) year_quarter_decimal(x) decimal_year_quarter(x)
x |
A |
length |
Length of |
The biggest difference is that the underlying data is simply
the number of months/quarters since epoch. This makes integer
arithmetic very simple, and allows for fast sequence creation as well as
fast coercion to year_month
and year_quarter
from numeric vectors.
Printing method is also fast.
library(timeplyr) library(lubridate) x <- year_month(today()) # Adding 1 adds 1 month x + 1 # Adding 12 adds 1 year x + 12 # Sequence of yearmonths x + 0:12 # If you unclass, do the same arithmetic, and coerce back to year_month # The result is always the same year_month(unclass(x) + 1) year_month(unclass(x) + 12) # Initialise a year_month or year_quarter to the specified length YM(0) YQ(0) YM(3) YQ(3)
library(timeplyr) library(lubridate) x <- year_month(today()) # Adding 1 adds 1 month x + 1 # Adding 12 adds 1 year x + 12 # Sequence of yearmonths x + 0:12 # If you unclass, do the same arithmetic, and coerce back to year_month # The result is always the same year_month(unclass(x) + 1) year_month(unclass(x) + 12) # Initialise a year_month or year_quarter to the specified length YM(0) YQ(0) YM(3) YQ(3)