| Title: | Respiratory Viral Infection Forecast Reporting |
|---|---|
| Description: | Tools for reporting and forecasting viral respiratory infections, using case surveillance data. Report generation tools for short-term forecasts, and validation metrics for an arbitrary number of customizable respiratory viruses. Estimation of the effective reproduction number is based on the 'EpiEstim' framework described in work by 'Cori' and colleagues. (2013) <doi:10.1093/aje/kwt133>. |
| Authors: | Mike Irvine [aut, cre, cph], Caesar Wong [aut], Nirupama Tamvada [aut], Rebeca Falcao [aut], Nelson Tang [aut] |
| Maintainer: | Mike Irvine <[email protected]> |
| License: | GPL (>= 3) |
| Version: | 1.0.5 |
| Built: | 2026-06-03 22:03:23 UTC |
| Source: | https://github.com/bccdc-phsa/viroreportr |
This function prepares case count data for use with EpiEstim by performing a series of validation and cleaning steps:
clean_sample_data(data, start_date)clean_sample_data(data, start_date)
data |
A data frame containing at least the columns |
start_date |
A |
Ensures that the input data frame has the required columns:
"date" and "confirm".
Confirms that the specified start_date exists in the data and filters
the data to include only records on or after that date.
Removes leading days before the first non-zero confirmed case.
Verifies that the resulting dataset contains at least 14 valid days (as required for estimation).
This function is primarily intended as a preprocessing step for EpiEstim modeling. It combines validation checks for input structure and time coverage with minimal data cleaning logic to ensure robust downstream estimation.
A cleaned data frame filtered from start_date, starting at the
first date with non-zero confirmed tests, and containing at least 14 days
of data.
d by groups along a variable
summarise a data frame d by groups along a variable
create_quantiles(d, ..., variable = NULL)create_quantiles(d, ..., variable = NULL)
d |
tibble data frame |
... |
group_by variables |
variable |
string |
Data frame containing sample quantiles at probabilities 0.05, 0.25, 0.50, 0.75 and 0.95
Print out text output for ViroReportR report detailing current number of case visits, last value of Rt and corresponding intervals
current_forecast_text(time_period_result, n_days, ...)current_forecast_text(time_period_result, n_days, ...)
time_period_result |
output from |
n_days |
The number of days to run simulations for. Defaults to 14 |
... |
optional arguments to be passed on to |
current forecast metrics
A wrapper function for estimate_R from the EpiEstim library to estimate the reproduction number of epidemics to support short-term forecasts
fit_epiestim_model( data, window_size = 7L, type = NULL, mean_si = NULL, std_si = NULL, recon_opt = "match", method = "parametric_si", mean_prior = NULL, std_prior = NULL )fit_epiestim_model( data, window_size = 7L, type = NULL, mean_si = NULL, std_si = NULL, recon_opt = "match", method = "parametric_si", mean_prior = NULL, std_prior = NULL )
data |
data frame containing two columns: date and confirm (number of cases) |
window_size |
Integer Length of the sliding windows used for R estimates. |
type |
character Specifies type of epidemic. Must be one of "flu_a", "flu_b", "rsv", "sars_cov2" or "custom" |
mean_si |
Numeric User specification of mean of parametric serial interval |
std_si |
Numeric User specification of standard deviation of parametric serial interval |
recon_opt |
Not implemented. One of "naive" or "match" to pass on to |
method |
One of "non_parametric_si", "parametric_si", "uncertain_si", "si_from_data" or "si_from_sample" to pass on to |
mean_prior |
Numeric positive number giving the mean of the common prior distribution for all reproduction numbers |
std_prior |
Numeric positive number giving the standard deviation of the common prior distribution for all reproduction numbers |
fit_epiestim_model currently supports the following epidemics: Influenza, RSV and COVID-19. The default serial intervals for the estimation of R were retrieved from
Cowling et al., 2011, Vink et al., 2014 and Madewell et al., 2023 for Influenza A, Influenza B, RSV and COVID (BA.5 Omicron variant) respectively
Object of class estimate_R (see EpiEstim help page)
Extract current forecast metrics: forecast prediction, percentile interval and Rt value
forecast_metrics(time_period_result, iter = 10)forecast_metrics(time_period_result, iter = 10)
time_period_result |
output from |
iter |
number of MCMC iterations used to generate Rt posterior |
dataframe of current forecast metrics
This function prepares epidemic data, estimates the reproduction number
() using fit_epiestim_model, and produces short-term
forecasts of daily confirmed cases with project_epiestim_model.
It removes early periods with no cases, checks data validity, optionally smooths the epidemic curve, and then generates forward projections of cases for a specified number of days.
generate_forecast( data, start_date, window_size = 7, n_days = 7, type = NULL, smooth_data = FALSE, smoothing_cutoff = 10, seed = 123, ... )generate_forecast( data, start_date, window_size = 7, n_days = 7, type = NULL, smooth_data = FALSE, smoothing_cutoff = 10, seed = 123, ... )
data |
data frame Must contain two columns:
|
start_date |
Date Date after which the epidemic is considered to have started. Data before this date is removed. |
window_size |
Integer Length of the sliding window (in days) used for reproduction number estimation. Default is 7. |
n_days |
Integer Number of future days to forecast. Default is 7. |
type |
character
Type of epidemic. Must be one of |
smooth_data |
logical
Whether to smooth the input daily case counts before estimation. Default
is |
smoothing_cutoff |
Integer
Cutoff parameter for smoothing. Only used if |
seed |
Integer or NULL
Random seed used to ensure reproducibility of the forecast.
If provided, the same input data will produce identical results across runs.
If set to |
... |
Additional arguments passed to |
Data prior to the first non-zero confirm value is excluded.
Input is checked for validity (sufficient days, proper format).
If smoothing is enabled, case counts are adjusted before fitting.
Forecasts are generated from the fitted EpiEstim model and returned with quantiles (2.5%, 25%, 50%, 75%, 97.5%), minimum, and maximum.
A data frame of forecasted daily incidence with columns:
date: date of forecast
p50, p25, p75, p025, p975: forecast quantiles
min_sim, max_sim: forecast range
fit_epiestim_model for reproduction number estimation,
project_epiestim_model for forward simulations.
# Create sample test rsv data disease_type <- "rsv" test_data <- simulate_data() formatted_data <- get_aggregated_data( test_data, number_column = disease_type, date_column = "date", start_date = "2024-04-01", end_date = "2024-05-01" ) # Run a 7 day forecast with smoothing res_smooth <- generate_forecast( data = formatted_data, start_date = "2024-04-01", n_days = 7, type = "rsv", smooth_data = FALSE )# Create sample test rsv data disease_type <- "rsv" test_data <- simulate_data() formatted_data <- get_aggregated_data( test_data, number_column = disease_type, date_column = "date", start_date = "2024-04-01", end_date = "2024-05-01" ) # Run a 7 day forecast with smoothing res_smooth <- generate_forecast( data = formatted_data, start_date = "2024-04-01", n_days = 7, type = "rsv", smooth_data = FALSE )
Generates a full-season forecast report for viral respiratory diseases as an HTML document.
generate_forecast_report( input_data_dir = NULL, output_dir = NULL, n_days = 7, validate_window_size = 7, smooth = FALSE, disease_season = NULL, seed = 123 )generate_forecast_report( input_data_dir = NULL, output_dir = NULL, n_days = 7, validate_window_size = 7, smooth = FALSE, disease_season = NULL, seed = 123 )
input_data_dir |
Path to input CSV data. Must contain columns: |
output_dir |
Path to output directory for the rendered HTML report. |
n_days |
Number of days ahead to forecast. Default is 7. |
validate_window_size |
The number of days between each validation window. Default is 7. |
smooth |
Logical indicating whether smoothing should be applied in the forecast. Default is |
disease_season |
An optional named list specifying the seasonal date ranges for each disease. Each element should be either:
For example: disease_season = list( flu_a = c("2024-09-01", "2025-03-01"), rsv = c("2024-09-01", "2025-03-01"), sars_cov2 = NULL ) This will produce a report where influenza A and RSV seasons run from September 1, 2024 to March 1, 2025, while no season is defined for SARS-CoV-2. |
seed |
Integer or NULL
Random seed used to ensure reproducibility of the forecast.
If provided, the same input data will produce identical results across runs.
If set to |
Invisibly returns the path to the rendered HTML report.
data <- simulate_data(start_date = "2024-01-07", #starting Sunday ) diseases <- c("flu_a", "rsv", "sars_cov2") data$date <- lubridate::ymd(data$date) vri_data_list <- purrr::set_names( purrr::map2( rep(list(data), length(diseases)), diseases, ~ get_aggregated_data(.x, "date", .y) ), diseases ) # Save the simulated data df <- purrr::imap_dfr( vri_data_list, \(df, disease) dplyr::mutate(df, disease_type = disease) ) tmp_dir <- tempdir() # temporary directory for example for saving data data_path <- file.path(tmp_dir, "simulated_data.csv") write.csv(df, data_path, row.names = FALSE) output_path <- tempdir() # output directory for report (temporary as example) generate_forecast_report(input_data_dir = data_path, output_dir = output_path, n_days = 7, validate_window_size = 7, smooth = FALSE)data <- simulate_data(start_date = "2024-01-07", #starting Sunday ) diseases <- c("flu_a", "rsv", "sars_cov2") data$date <- lubridate::ymd(data$date) vri_data_list <- purrr::set_names( purrr::map2( rep(list(data), length(diseases)), diseases, ~ get_aggregated_data(.x, "date", .y) ), diseases ) # Save the simulated data df <- purrr::imap_dfr( vri_data_list, \(df, disease) dplyr::mutate(df, disease_type = disease) ) tmp_dir <- tempdir() # temporary directory for example for saving data data_path <- file.path(tmp_dir, "simulated_data.csv") write.csv(df, data_path, row.names = FALSE) output_path <- tempdir() # output directory for report (temporary as example) generate_forecast_report(input_data_dir = data_path, output_dir = output_path, n_days = 7, validate_window_size = 7, smooth = FALSE)
This function performs rolling validation of short-term forecasts generated by EpiEstim or similar models. It divides the input time series into overlapping validation windows and repeatedly runs forecasts to assess model performance across different time segments.
generate_validation( data, start_date, validate_window_size = 7, window_size = 7, n_days = 7, type = NULL, smooth_data = FALSE, smoothing_cutoff = 10, seed = 123, ... )generate_validation( data, start_date, validate_window_size = 7, window_size = 7, n_days = 7, type = NULL, smooth_data = FALSE, smoothing_cutoff = 10, seed = 123, ... )
data |
A data frame containing at least the columns |
start_date |
A |
validate_window_size |
Integer. The number of days between each
validation window (default: |
window_size |
Integer. The sliding window size (in days) used by the
forecasting model (default: |
n_days |
Integer. The number of future days to forecast in each
validation iteration (default: |
type |
character
Type of epidemic. Must be one of |
smooth_data |
Logical. Whether to smooth the input case counts prior
to forecasting (default: |
smoothing_cutoff |
Numeric. Threshold used for smoothing when
|
seed |
Integer or NULL
Random seed used to ensure reproducibility of the forecast.
If provided, the same input data will produce identical results across runs.
If set to |
... |
Additional arguments passed to |
The validation procedure ensures that forecasts are evaluated under realistic temporal conditions. Starting from the earliest date, the function repeatedly:
Takes a growing subset of data up to the current validation endpoint.
Runs the forecast using generate_forecast().
Moves the validation window forward by validate_window_size days.
This results in a set of forecasts that can be compared to observed data to evaluate predictive performance across time.
A list of forecast results, each element corresponding to one
validation window. Each element contains the output returned by
generate_forecast() for that particular window.
clean_sample_data(), generate_forecast()
data <- simulate_data() formatted_data <- get_aggregated_data(data,"date", "flu_a", "2024-10-16", "2024-12-31") start_date <- as.Date("2024-10-16") validation_results <- generate_validation(formatted_data, start_date, type="flu_a")data <- simulate_data() formatted_data <- get_aggregated_data(data,"date", "flu_a", "2024-10-16", "2024-12-31") start_date <- as.Date("2024-10-16") validation_results <- generate_validation(formatted_data, start_date, type="flu_a")
This function evaluates forecast accuracy across multiple validation runs by computing two key performance metrics:
generate_validation_metric(data, validation_res)generate_validation_metric(data, validation_res)
data |
A data frame used in
|
validation_res |
A list of forecast validation results, typically the
output from
|
Symmetric Mean Absolute Percentage Error (SMAPE): Measures relative forecast accuracy while remaining robust to zero values in the actual data.
Mean Absolute Scaled Error (MASE): Scales forecast errors relative to the in-sample one-step naïve forecast, allowing comparison across series with different scales.
For each forecast result, the function also reports the corresponding training and forecast periods. Computation stops once the forecast period reaches the maximum date in the model data.
SMAPE is defined as:
where \(A\) are actual values and \(F\) are forecasts. It avoids division by zero and is suitable for count data with zeros.
MASE compares the mean absolute forecast error against the mean absolute difference of successive actual:
The function automatically excludes forecasts extending beyond the latest date in the observed model data.
A tibble (data frame) with one row per forecast result and the
following columns:
train_period: Date range of the training period used for the forecast.
forecast_period: Date range of the forecasted period.
smape: Symmetric Mean Absolute Percentage Error between forecasted and
actual values, rounded to two decimals.
mase: Mean Absolute Scaled Error, rounded to two decimals.
generate_validation(), generate_forecast()
data <- simulate_data() formatted_data <- get_aggregated_data(data,"date", "flu_a", "2024-10-16", "2024-12-31") start_date <- ("2024-10-16") validation_results <- generate_validation(formatted_data, start_date, type ="flu_a") generate_validation_metric(formatted_data, validation_results)data <- simulate_data() formatted_data <- get_aggregated_data(data,"date", "flu_a", "2024-10-16", "2024-12-31") start_date <- ("2024-10-16") validation_results <- generate_validation(formatted_data, start_date, type ="flu_a") generate_validation_metric(formatted_data, validation_results)
get_aggregated_data() performs data transformation in the following steps:
Group the weekly or daily data by date.
Aggregate the number of confirmed cases by either day or week.
Select only the date and confirmed cases column.
Filter the data by given start and end date
The input dataframe generic_data must have the following columns:
<date name>: date column (e.g. as.Date('2022-01-01')).
<cases count name>: Confirmed Cases Count (e.g. 1, 2, ...).
Note that these columns can be defined in a generic name, and inputted as
the other two function parameters for data transformation (date_column,
number_column)
Assume the date column is the start of the epiweek.
get_aggregated_data( generic_data, date_column, number_column, start_date = NULL, end_date = NULL, unit = "day" )get_aggregated_data( generic_data, date_column, number_column, start_date = NULL, end_date = NULL, unit = "day" )
generic_data |
the weekly generic data from |
date_column |
date column name str |
number_column |
cases count column name str |
start_date |
start date string (e.g. '2022-01-01')(optional, default is NULL) |
end_date |
end date string (e.g. '2022-12-31')(optional, default is NULL) |
unit |
aggregation unit "day" or "week" |
aggregated weekly data of the generic confirmed cases data (filtered by date if any)
Either day or week date
number of confirmed cases
sim_data <- simulate_data() aggregated_data <- get_aggregated_data( sim_data, "date", "flu_a", "2024-10-16", "2024-12-31" )sim_data <- simulate_data() aggregated_data <- get_aggregated_data( sim_data, "date", "flu_a", "2024-10-16", "2024-12-31" )
Plot Mean Rt with time index (dates)
plot_rt(forecast_results)plot_rt(forecast_results)
forecast_results |
is the output of |
Mean Rt with time index plot
# Create sample test rsv data disease_type <- "rsv" test_data <- simulate_data() formatted_data <- get_aggregated_data( test_data, number_column = disease_type, date_column = "date", start_date = "2024-04-01", end_date = "2024-05-01" ) # Run a 7 day forecast with smoothing forecast_results <- generate_forecast( data = formatted_data, start_date = "2024-04-01", n_days = 7, type = "rsv", smooth_data = FALSE ) plot_rt(forecast_results)# Create sample test rsv data disease_type <- "rsv" test_data <- simulate_data() formatted_data <- get_aggregated_data( test_data, number_column = disease_type, date_column = "date", start_date = "2024-04-01", end_date = "2024-05-01" ) # Run a 7 day forecast with smoothing forecast_results <- generate_forecast( data = formatted_data, start_date = "2024-04-01", n_days = 7, type = "rsv", smooth_data = FALSE ) plot_rt(forecast_results)
Plot a ribbon plot with each time horizon predictions against true values for validation
plot_validation(data, validation_res, pred_plot = "ribbon")plot_validation(data, validation_res, pred_plot = "ribbon")
data |
A data frame used in
|
validation_res |
A list of forecast validation results, typically
produced by
|
pred_plot |
either |
error_bar validation plot or ribbon validation plot for a specific prediction horizon
data <- simulate_data() formatted_data <- get_aggregated_data(data,"date", "flu_a", "2024-10-16", "2024-12-31") start_date <- ("2024-10-16") validation_results <- generate_validation(formatted_data, start_date, type ="flu_a") plot_validation(formatted_data, validation_results)data <- simulate_data() formatted_data <- get_aggregated_data(data,"date", "flu_a", "2024-10-16", "2024-12-31") start_date <- ("2024-10-16") validation_results <- generate_validation(formatted_data, start_date, type ="flu_a") plot_validation(formatted_data, validation_results)
Function to produce short-term daily projections from objects of class estimate_R
project_epiestim_model(data, model_fit, n_days = 7, n_sim = 1000)project_epiestim_model(data, model_fit, n_days = 7, n_sim = 1000)
data |
data frame containing two columns: date and confirm (number of cases per day) |
model_fit |
Object of class |
n_days |
The number of days to run simulations for. Defaults to 14 |
n_sim |
The number of epicurves to simulate. Defaults to 1000 |
Data-frame of daily forecast samples from all simulations
date
projected number of daily confirmed cases
simulation run number
Generates simulated daily incidence data for specified respiratory viruses over a defined number of days. Each virus is modeled using a Gaussian-like curve, parameterized by peak day, amplitude, and scale.
simulate_data( days = 365, peaks = c(flu_a = 90, rsv = 110, sars_cov2 = 160), amplitudes = c(flu_a = 50, rsv = 40, sars_cov2 = 20), scales = c(flu_a = -0.004, rsv = -0.005, sars_cov2 = -0.001), time_offset = 0, noise_sd = 5, start_date = "2024-01-07" )simulate_data( days = 365, peaks = c(flu_a = 90, rsv = 110, sars_cov2 = 160), amplitudes = c(flu_a = 50, rsv = 40, sars_cov2 = 20), scales = c(flu_a = -0.004, rsv = -0.005, sars_cov2 = -0.001), time_offset = 0, noise_sd = 5, start_date = "2024-01-07" )
days |
Integer. Number of days to simulate (default is 365). |
peaks |
Named numeric vector. Peak day for each virus
(e.g., |
amplitudes |
Named numeric vector. Amplitude for each virus's peak
(e.g., |
scales |
Named numeric vector. Scale controlling spread of the peak
for each virus (e.g., |
time_offset |
Integer. Number of days to offset start of the simulation. useful if want to test data with larger values in the middle of a respiratory season. |
noise_sd |
numeric or named numeric.
Gaussian noise applied to each virus signal. can either be a single value
or named for each virus e.g., |
start_date |
string |
A data frame with daily simulated incidence counts for each virus,
including a date column.
simulate_data() simulate_data(days = 100, peaks = c(flu_a = 30), amplitudes = c(flu_a = 60), scales = c(flu_a = -0.01), noise_sd = c(flu_a = 5))simulate_data() simulate_data(days = 100, peaks = c(flu_a = 30), amplitudes = c(flu_a = 60), scales = c(flu_a = -0.01), noise_sd = c(flu_a = 5))