Package 'DataFakeR' reference manual

Title:	Generate Fake Data for Relational Databases
Description:	Based on provided database description and/or database connection generate data sample preserving source structure.
Authors:	Krystian Igras [aut, cre], Kamil Wais [ctb], Adam Foryś [ctb], Adam Leśniewski [ctb], Paweł Kawski [ctb]
Maintainer:	Krystian Igras <[email protected]>
License:	MIT + file LICENSE
Version:	0.1.3
Built:	2025-03-21 20:19:21 UTC
Source:	https://github.com/cranhaven/cranhaven.r-universe.dev

Setup default column type parameters

Description

All the parameters (excluding regexp) are attached to column definition when the ones are not specified in configuration YAML file. All the functions are used to specify default configuration (see: default_faker_opts).

Usage

opt_default_character(
  regexp = "text|char|factor",
  nchar = 10,
  na_ratio = 0.05,
  not_null = FALSE,
  unique = FALSE,
  default = "",
  levels_ratio = 1,
  ...
)

opt_default_numeric(
  regexp = "^decimal|^numeric|real|double precision",
  na_ratio = 0.05,
  not_null = FALSE,
  unique = FALSE,
  default = 0,
  precision = 7,
  scale = 2,
  levels_ratio = 1,
  ...
)

opt_default_integer(
  regexp = "smallint|integer|bigint|smallserial|serial|bigserial",
  na_ratio = 0.05,
  not_null = FALSE,
  unique = FALSE,
  default = "",
  levels_ratio = 1,
  ...
)

opt_default_logical(
  regexp = "boolean|logical",
  na_ratio = 0.05,
  not_null = FALSE,
  unique = FALSE,
  default = FALSE,
  levels_ratio = 1,
  ...
)

opt_default_date(
  regexp = "date|Date",
  na_ratio = 0.05,
  not_null = FALSE,
  unique = FALSE,
  default = Sys.Date(),
  format = "%Y-%m-%d",
  min_date = as.Date("1970-01-01"),
  max_date = Sys.Date(),
  levels_ratio = 1,
  ...
)
opt_default_character(
  regexp = "text|char|factor",
  nchar = 10,
  na_ratio = 0.05,
  not_null = FALSE,
  unique = FALSE,
  default = "",
  levels_ratio = 1,
  ...
)

opt_default_numeric(
  regexp = "^decimal|^numeric|real|double precision",
  na_ratio = 0.05,
  not_null = FALSE,
  unique = FALSE,
  default = 0,
  precision = 7,
  scale = 2,
  levels_ratio = 1,
  ...
)

opt_default_integer(
  regexp = "smallint|integer|bigint|smallserial|serial|bigserial",
  na_ratio = 0.05,
  not_null = FALSE,
  unique = FALSE,
  default = "",
  levels_ratio = 1,
  ...
)

opt_default_logical(
  regexp = "boolean|logical",
  na_ratio = 0.05,
  not_null = FALSE,
  unique = FALSE,
  default = FALSE,
  levels_ratio = 1,
  ...
)

opt_default_date(
  regexp = "date|Date",
  na_ratio = 0.05,
  not_null = FALSE,
  unique = FALSE,
  default = Sys.Date(),
  format = "%Y-%m-%d",
  min_date = as.Date("1970-01-01"),
  max_date = Sys.Date(),
  levels_ratio = 1,
  ...
)

Arguments

`regexp`	Regular expression that allows mapping YAML configuration column type to desired R class.
`nchar`	Maximum number of characters when simulating character values. When source column is of type `char(n)` the parameter is ignored.
`na_ratio`	Ratio of NA values returned in simulated sample.
`not_null`	Should the column allow to simulate NA values?
`unique`	Should column values be unique?
`default`	Default column value. Ignored during simulation.
`levels_ratio`	Ratio of unique values (in terms of sample length) simulated in the sample.
`...`	Other default parameters attached to the column definition.
`precision`	Precision of numeric column value when simulating numeric values. When source column is of type e.g. `numeric(precision)` the parameter is ignored.
`scale`	Precision of numeric column value when simulating numeric values. When source column is of type e.g. `numeric(precision, scale)` the parameter is ignored.
`format`	Format of date used when simulating Date columns.
`min_date`, `max_date`	Minimum and maximum date used when simulating Date columns.

Default options for pulling metadata and data simulation

Description

Generated with the set of configuration functions: default_simulation_params, opt_default_table, special_simulation, restricted_simulation, sourcing_metadata.

Usage

default_faker_opts

set_faker_opts(
  opt_pull_character,
  opt_pull_numeric,
  opt_pull_integer,
  opt_pull_logical,
  opt_pull_date,
  opt_pull_table,
  opt_default_character,
  opt_simul_spec_character,
  opt_simul_restricted_character,
  opt_simul_default_fun_character,
  opt_default_numeric,
  opt_simul_spec_numeric,
  opt_simul_restricted_numeric,
  opt_simul_default_fun_numeric,
  opt_default_integer,
  opt_simul_spec_integer,
  opt_simul_restricted_integer,
  opt_simul_default_fun_integer,
  opt_default_logical,
  opt_simul_spec_logical,
  opt_simul_restricted_logical,
  opt_simul_default_fun_logical,
  opt_default_date,
  opt_simul_spec_date,
  opt_simul_restricted_date,
  opt_simul_default_fun_date,
  opt_default_table,
  global = TRUE
)

get_faker_opts()
default_faker_opts

set_faker_opts(
  opt_pull_character,
  opt_pull_numeric,
  opt_pull_integer,
  opt_pull_logical,
  opt_pull_date,
  opt_pull_table,
  opt_default_character,
  opt_simul_spec_character,
  opt_simul_restricted_character,
  opt_simul_default_fun_character,
  opt_default_numeric,
  opt_simul_spec_numeric,
  opt_simul_restricted_numeric,
  opt_simul_default_fun_numeric,
  opt_default_integer,
  opt_simul_spec_integer,
  opt_simul_restricted_integer,
  opt_simul_default_fun_integer,
  opt_default_logical,
  opt_simul_spec_logical,
  opt_simul_restricted_logical,
  opt_simul_default_fun_logical,
  opt_default_date,
  opt_simul_spec_date,
  opt_simul_restricted_date,
  opt_simul_default_fun_date,
  opt_default_table,
  global = TRUE
)

get_faker_opts()

Arguments

opt_pull_character, opt_pull_numeric, opt_pull_integer, opt_pull_logical, opt_pull_date, opt_pull_table, opt_default_character, opt_simul_spec_character, opt_simul_restricted_character, opt_simul_default_fun_character, opt_default_numeric, opt_simul_spec_numeric, opt_simul_restricted_numeric, opt_simul_default_fun_numeric, opt_default_integer, opt_simul_spec_integer, opt_simul_restricted_integer, opt_simul_default_fun_integer, opt_default_logical, opt_simul_spec_logical, opt_simul_restricted_logical, opt_simul_default_fun_logical, opt_default_date, opt_simul_spec_date, opt_simul_restricted_date, opt_simul_default_fun_date, opt_default_table

Parameters defined in default configuration that can be modified by using set_faker_opts function. Please make sure each parameter is specified by method designed to it.

global

If TRUE, default the configuration will be set up globally (no need to pass it as a faker_opts parameter for schema_source and schema_methods).

Format

An object of class list of length 27.

Details

set_faker_opts allows to overwrite selected options. get_faker_opts lists the current options configuration.

Methods for extracting number of target rows in simulation

Description

Each method returns function of list of tables. The value of such function is named list being mapping between tables (names of list) and target number of rows (values of list). Such methods can be passed as nrows parameter of opt_default_table.

Usage

nrows_simul_constant(n, force = FALSE)

nrows_simul_ratio(ratio, total, force = FALSE)
nrows_simul_constant(n, force = FALSE)

nrows_simul_ratio(ratio, total, force = FALSE)

Arguments

`n`	Default number of rows for each table when not defined in configuration file.
`force`	Should specified parameters overwrite related configuration parameters?
`ratio`, `total`	The parameters multiplications results with defining target number of rows for simulated table. See details section.

Details

Currently supported methods are:

nrows_simul_constant Returns n rows for each table when not defined in YAML parameter nrows
nrows_simul_ratio Returns nrows * ratio when nrows defined as YAML parameter and is integer. Returns nrows when nrows defined as YAML parameter and id fraction, Returns n * ratio otherwise.

Configure data simulation options

Description

The parameters affect high level (not column type related) simulation settings such as target number of rows for each table. Currently only number of simulated rows is supported.

Usage

opt_default_table(nrows = nrows_simul_constant(10))
opt_default_table(nrows = nrows_simul_constant(10))

Arguments

nrows

Integer or function. When nrows is precised as an integer value, all the tables will have the same number of rows. In case of function, the should take tables configuration (list of tables section from configuration YSML file) and return named list of table with rows values. See nrows_simul_constant and nrows_simul_ratio for more details.

Simulate data restricted by extra column parameters

Description

The functions allow to define a set of methods for simulating data using additional column-based parameters such as range or values.

Usage

opt_simul_restricted_character(
  f_key = simul_restricted_character_fkey,
  ...,
  in_set = simul_restricted_character_in_set
)

opt_simul_restricted_numeric(
  f_key = simul_restricted_numeric_fkey,
  ...,
  in_set = simul_restricted_numeric_in_set,
  range = simul_restricted_numeric_range
)

opt_simul_restricted_integer(
  f_key = simul_restricted_integer_fkey,
  ...,
  in_set = simul_restricted_integer_in_set,
  range = simul_restricted_integer_range
)

opt_simul_restricted_logical(f_key = simul_restricted_integer_fkey, ...)

opt_simul_restricted_date(
  f_key = simul_restricted_integer_fkey,
  ...,
  range = simul_restricted_date_range
)
opt_simul_restricted_character(
  f_key = simul_restricted_character_fkey,
  ...,
  in_set = simul_restricted_character_in_set
)

opt_simul_restricted_numeric(
  f_key = simul_restricted_numeric_fkey,
  ...,
  in_set = simul_restricted_numeric_in_set,
  range = simul_restricted_numeric_range
)

opt_simul_restricted_integer(
  f_key = simul_restricted_integer_fkey,
  ...,
  in_set = simul_restricted_integer_in_set,
  range = simul_restricted_integer_range
)

opt_simul_restricted_logical(f_key = simul_restricted_integer_fkey, ...)

opt_simul_restricted_date(
  f_key = simul_restricted_integer_fkey,
  ...,
  range = simul_restricted_date_range
)

Arguments

`f_key`	Method for simulating foreign key columns. The `values` parameter of the function, receives all the unique values from parent primary key column.
`...`	Other methods that can be defined to handle extra parameters.
`in_set`	Method for simulating columns from defined set of values. The `values` parameter of the function, take all the values defined in YAML column definition as `values` parameter.
`range`	Method for simulating columns fitting inside defined range. It takes special parameter `range` 2-length vector minimum and maximum value for simulated data.

Details

Except for the standard column parameters, that are now:

type
unique
not_null
default
nchar
min_date
max_date
precision
scale

it is also allowed to add custom ones (either directly in YAML configuration file, or in opt_default_<column_type> functions).

In order to respect simulation using such parameters, we may want to define our custom simulation functions.

Such functions should be defined as a parameters of opt_simul_restricted_<column_type> functions, and each of them should take special parameter as its own one.

When the parameter condition is not met (for example the parameter is missing) such function should return NULL value. This allows the simulation workflow to move to the next defined method. The order of methods execution is followed by the order of defined parameters in the below methods.

That means, the highest priority always have f_key - a special method that is used for foreign key columns, and simulates only from values received from parent primary key.

The second priority method for character type columns is in_set, that seeks for values column parameter, and when such exists it simulates the data from defined set of values. See simul_restricted_character_in_set definition to check details.

Modify sample with desired condition

Description

The set of function that allows to perform most common operations ion data sample.

Usage

unique_sample(sim_expr, ..., unique = TRUE, n_name = "n", n_iter = 10)

na_rand(sample_vec, na_ratio, not_null = FALSE)

levels_rand(sample_vec, levels_ratio, unique)
unique_sample(sim_expr, ..., unique = TRUE, n_name = "n", n_iter = 10)

na_rand(sample_vec, na_ratio, not_null = FALSE)

levels_rand(sample_vec, levels_ratio, unique)

Arguments

`sim_expr`	Expression to be evaluated in order to get column sample.
`...`	Parameters and their values that are used in `sim_expr`.
`unique`	If TRUE the function will try to simulate unique values.
`n_name`	Name of the parameter providing sample length (for example 'n' for `rnorm` and 'size' for `sample`).
`n_iter`	Number of iteration to make to assure the returned values are unique.
`sample_vec`	Vector to which NA values should be injected.
`na_ratio`	Ratio (in terms of column length) of NA values to attach to the sample.
`not_null`	Information whether NA's are allowed.
`levels_ratio`	Ratio of unique levels in terms of whole sample length.

Details

unique_sample - takes simulation expression and assures the expression will be executed as many times as needed to return unique result sample. na_rand - attaches NA values to the sample according to provided NA's ratio. levels_rand - takes provided number of sample levels, and assures the returned sample have as many levels as requested.

Examples


unique_sample(rnorm(n, mean = my_mean), n = 10, my_mean = 2)
unique_sample(sample(values, size, replace = TRUE), size = 10, values = 1:10, n_name = "size")

## Not run: 
  ## In 10 iterations it was not possible to simulate 6 unique values from the vector 1:5
  unique_sample(sample(values, size, replace = TRUE), size = 6, values = 1:5, n_name = "size")

## End(Not run)

na_rand(1:10, na_ratio = 0.5)

unique_sample(rnorm(n, mean = my_mean), n = 10, my_mean = 2)
unique_sample(sample(values, size, replace = TRUE), size = 10, values = 1:10, n_name = "size")

## Not run: 
  ## In 10 iterations it was not possible to simulate 6 unique values from the vector 1:5
  unique_sample(sample(values, size, replace = TRUE), size = 6, values = 1:5, n_name = "size")

## End(Not run)

na_rand(1:10, na_ratio = 0.5)

Schema object methods

Description

The set of methods that can be used on schema object returned by schema_source function.

Usage

schema_update_source(
  schema,
  file,
  faker_opts = getOption("dfkr_options", default_faker_opts)
)

schema_get_table(schema, table_name)

schema_plot_deps(schema, table_name)

schema_simulate(schema)
schema_update_source(
  schema,
  file,
  faker_opts = getOption("dfkr_options", default_faker_opts)
)

schema_get_table(schema, table_name)

schema_plot_deps(schema, table_name)

schema_simulate(schema)

Arguments

`schema`	Schema object keeping table dependency graph.
`file`	Path to schema configuration yaml file.
`faker_opts`	Structure sourcing and columns simulation config.
`table_name`	Name of the table.

Details

The methods are:

schema_update_source Update schema dependency graph based on provided file.
schema_simulate Run data simulation process.
schema_get_table Get simulated table value.
schema_plot_deps Plot inter or inner table dependecies.

Source schema file into dependency graph object

Description

The functions parses table schema (from database) and saves its structure yaml format. The defined structure is then used to prepare schema dependency graph, that is:

dependencies between tablesBased on foreign key definitions
inner table column dependenciesBased on defined dependencies by various methods. See vignette('todo').

Usage

schema_source(
  source,
  schema = "public",
  file = if (is.character(source)) source else file.path(getwd(), "schema.yml"),
  faker_opts = getOption("dfkr_options", default_faker_opts)
)
schema_source(
  source,
  schema = "public",
  file = if (is.character(source)) source else file.path(getwd(), "schema.yml"),
  faker_opts = getOption("dfkr_options", default_faker_opts)
)

Arguments

`source`	Connection to Redshift or Postgres database or path to YAML configuration file from which schema metadata should be sourced. When missing `file` defined file will be sourced if existing.
`schema`	Schema name from which the structure should be sourced.
`file`	Path to yaml file describing database schema, or target file when schema should be saved (when `db_conn` not mising). See `vignette('todo')`.
`faker_opts`	Structure sourcing and columns simulation config.

Details

Detected dependencies are then saved in R6Class object that is returned and possible to pass for further methods. See schema_methods.

Keeping the schema as a graph allows to perform simulation process in proper order, preserving table dependencies and constraints.

Character type simulation methods

Description

Character type simulation methods

Usage

simul_spec_character_name(
  n,
  not_null,
  unique,
  default,
  spec_params,
  na_ratio,
  levels_ratio,
  ...
)

simul_default_character(
  n,
  not_null,
  unique,
  default,
  nchar,
  type,
  na_ratio,
  levels_ratio,
  ...
)

simul_restricted_character_in_set(
  n,
  not_null,
  unique,
  default,
  nchar,
  type,
  values,
  na_ratio,
  levels_ratio,
  ...
)

simul_restricted_character_fkey(
  n,
  not_null,
  unique,
  default,
  nchar,
  type,
  values,
  na_ratio,
  levels_ratio,
  ...
)
simul_spec_character_name(
  n,
  not_null,
  unique,
  default,
  spec_params,
  na_ratio,
  levels_ratio,
  ...
)

simul_default_character(
  n,
  not_null,
  unique,
  default,
  nchar,
  type,
  na_ratio,
  levels_ratio,
  ...
)

simul_restricted_character_in_set(
  n,
  not_null,
  unique,
  default,
  nchar,
  type,
  values,
  na_ratio,
  levels_ratio,
  ...
)

simul_restricted_character_fkey(
  n,
  not_null,
  unique,
  default,
  nchar,
  type,
  values,
  na_ratio,
  levels_ratio,
  ...
)

Arguments

`n`	Number of values to simulate.
`not_null`	Should NA values be forbidden?
`unique`	Should duplicated values be allowed?
`default`	Default column value.
`spec_params`	Set of parameters passed to special method.
`na_ratio`	Ratio of NA values (in terms of sample length) the sample should have.
`levels_ratio`	Fraction of levels (in terms of sample length) the sample should have.
`...`	Other parameters passed to column configuration in YAML file.
`nchar`	Maximum number of characters for each value.
`type`	Column raw type (sourced from configuration file).
`values`	Possible values from which to perform simulation.

Date type simulation methods

Description

Date type simulation methods

Usage

simul_spec_date_distr(
  n,
  not_null,
  unique,
  default,
  spec_params,
  na_ratio,
  levels_ratio,
  ...
)

simul_default_date(
  n,
  not_null,
  unique,
  default,
  type,
  min_date,
  max_date,
  format,
  na_ratio,
  levels_ratio,
  ...
)

simul_restricted_date_range(
  n,
  not_null,
  unique,
  default,
  type,
  range,
  format,
  na_ratio,
  levels_ratio,
  ...
)

simul_restricted_date_fkey(
  n,
  not_null,
  unique,
  default,
  type,
  values,
  na_ratio,
  levels_ratio,
  ...
)
simul_spec_date_distr(
  n,
  not_null,
  unique,
  default,
  spec_params,
  na_ratio,
  levels_ratio,
  ...
)

simul_default_date(
  n,
  not_null,
  unique,
  default,
  type,
  min_date,
  max_date,
  format,
  na_ratio,
  levels_ratio,
  ...
)

simul_restricted_date_range(
  n,
  not_null,
  unique,
  default,
  type,
  range,
  format,
  na_ratio,
  levels_ratio,
  ...
)

simul_restricted_date_fkey(
  n,
  not_null,
  unique,
  default,
  type,
  values,
  na_ratio,
  levels_ratio,
  ...
)

Arguments

`n`	Number of values to simulate.
`not_null`	Should NA values be forbidden?
`unique`	Should duplicated values be allowed?
`default`	Default column value.
`spec_params`	Set of parameters passed to special method.
`na_ratio`	Ratio of NA values (in terms of sample length) the sample should have.
`levels_ratio`	Fraction of levels (in terms of sample length) the sample should have.
`...`	Other parameters passed to column configuration in YAML file.
`type`	Column raw type (sourced from configuration file).
`format`	Date format used to store dates.
`range`, `min_date`, `max_date`	Date range or minimum and maximum date from which to simulate data.
`values`	Possible values from which to perform simulation.

Integer type simulation methods

Description

Integer type simulation methods

Usage

simul_spec_integer_distr(
  n,
  not_null,
  unique,
  default,
  spec_params,
  na_ratio,
  levels_ratio,
  ...
)

simul_default_integer(
  n,
  not_null,
  unique,
  default,
  type,
  na_ratio,
  levels_ratio,
  ...
)

simul_restricted_integer_range(
  n,
  not_null,
  unique,
  default,
  type,
  range,
  na_ratio,
  levels_ratio,
  ...
)

simul_restricted_integer_in_set(
  n,
  not_null,
  unique,
  default,
  type,
  values,
  na_ratio,
  levels_ratio,
  ...
)

simul_restricted_integer_fkey(
  n,
  not_null,
  unique,
  default,
  type,
  values,
  na_ratio,
  levels_ratio,
  ...
)
simul_spec_integer_distr(
  n,
  not_null,
  unique,
  default,
  spec_params,
  na_ratio,
  levels_ratio,
  ...
)

simul_default_integer(
  n,
  not_null,
  unique,
  default,
  type,
  na_ratio,
  levels_ratio,
  ...
)

simul_restricted_integer_range(
  n,
  not_null,
  unique,
  default,
  type,
  range,
  na_ratio,
  levels_ratio,
  ...
)

simul_restricted_integer_in_set(
  n,
  not_null,
  unique,
  default,
  type,
  values,
  na_ratio,
  levels_ratio,
  ...
)

simul_restricted_integer_fkey(
  n,
  not_null,
  unique,
  default,
  type,
  values,
  na_ratio,
  levels_ratio,
  ...
)

Arguments

`n`	Number of values to simulate.
`not_null`	Should NA values be forbidden?
`unique`	Should duplicated values be allowed?
`default`	Default column value.
`spec_params`	Set of parameters passed to special method.
`na_ratio`	Ratio of NA values (in terms of sample length) the sample should have.
`levels_ratio`	Fraction of levels (in terms of sample length) the sample should have.
`...`	Other parameters passed to column configuration in YAML file.
`type`	Column raw type (sourced from configuration file).
`range`	Possible range of values from which to perform simulation.
`values`	Possible values from which to perform simulation.

Logical type simulation methods

Description

Logical type simulation methods

Usage

simul_spec_logical_distr(
  n,
  not_null,
  unique,
  default,
  spec_params,
  na_ratio,
  levels_ratio,
  ...
)

simul_default_logical(
  n,
  not_null,
  unique,
  default,
  type,
  na_ratio,
  levels_ratio,
  ...
)

simul_restricted_logical_fkey(
  n,
  not_null,
  unique,
  default,
  type,
  values,
  na_ratio,
  levels_ratio,
  ...
)
simul_spec_logical_distr(
  n,
  not_null,
  unique,
  default,
  spec_params,
  na_ratio,
  levels_ratio,
  ...
)

simul_default_logical(
  n,
  not_null,
  unique,
  default,
  type,
  na_ratio,
  levels_ratio,
  ...
)

simul_restricted_logical_fkey(
  n,
  not_null,
  unique,
  default,
  type,
  values,
  na_ratio,
  levels_ratio,
  ...
)

Arguments

`n`	Number of values to simulate.
`not_null`	Should NA values be forbidden?
`unique`	Should duplicated values be allowed?
`default`	Default column value.
`spec_params`	Set of parameters passed to special method.
`na_ratio`	Ratio of NA values (in terms of sample length) the sample should have.
`levels_ratio`	Fraction of levels (in terms of sample length) the sample should have.
`...`	Other parameters passed to column configuration in YAML file.
`type`	Column raw type (sourced from configuration file).
`values`	Possible values from which to perform simulation.

Numeric type simulation methods

Description

Numeric type simulation methods

Usage

simul_spec_numeric_distr(
  n,
  not_null,
  unique,
  default,
  spec_params,
  na_ratio,
  levels_ratio,
  ...
)

simul_default_numeric(
  n,
  not_null,
  unique,
  default,
  type,
  na_ratio,
  levels_ratio,
  ...
)

simul_restricted_numeric_range(
  n,
  not_null,
  unique,
  default,
  type,
  range,
  na_ratio,
  levels_ratio,
  ...
)

simul_restricted_numeric_in_set(
  n,
  not_null,
  unique,
  default,
  type,
  values,
  na_ratio,
  levels_ratio,
  ...
)

simul_restricted_numeric_fkey(
  n,
  not_null,
  unique,
  default,
  type,
  values,
  na_ratio,
  levels_ratio,
  ...
)
simul_spec_numeric_distr(
  n,
  not_null,
  unique,
  default,
  spec_params,
  na_ratio,
  levels_ratio,
  ...
)

simul_default_numeric(
  n,
  not_null,
  unique,
  default,
  type,
  na_ratio,
  levels_ratio,
  ...
)

simul_restricted_numeric_range(
  n,
  not_null,
  unique,
  default,
  type,
  range,
  na_ratio,
  levels_ratio,
  ...
)

simul_restricted_numeric_in_set(
  n,
  not_null,
  unique,
  default,
  type,
  values,
  na_ratio,
  levels_ratio,
  ...
)

simul_restricted_numeric_fkey(
  n,
  not_null,
  unique,
  default,
  type,
  values,
  na_ratio,
  levels_ratio,
  ...
)

Arguments

`n`	Number of values to simulate.
`not_null`	Should NA values be forbidden?
`unique`	Should duplicated values be allowed?
`default`	Default column value.
`spec_params`	Set of parameters passed to special method.
`na_ratio`	Ratio of NA values (in terms of sample length) the sample should have.
`levels_ratio`	Fraction of levels (in terms of sample length) the sample should have.
`...`	Other parameters passed to column configuration in YAML file.
`type`	Column raw type (sourced from configuration file).
`range`	Possible range of values from which to perform simulation.
`values`	Possible values from which to perform simulation.

Specify YAML configuration options while pulling the schema from DB

Description

The set of function allows to configure which data information should be saved to configuration YAML file when such configuration is sourced directly from database schema.

Usage

opt_pull_character(
  values = TRUE,
  max_uniq_to_pull = 10,
  nchar = TRUE,
  na_ratio = TRUE,
  levels_ratio = TRUE,
  ...
)

opt_pull_numeric(
  values = TRUE,
  max_uniq_to_pull = 10,
  range = TRUE,
  precision = TRUE,
  scale = TRUE,
  na_ratio = TRUE,
  levels_ratio = FALSE,
  ...
)

opt_pull_integer(
  values = TRUE,
  max_uniq_to_pull = 10,
  range = TRUE,
  na_ratio = TRUE,
  levels_ratio = FALSE,
  ...
)

opt_pull_date(range = TRUE, na_ratio = TRUE, levels_ratio = FALSE, ...)

opt_pull_logical(na_ratio = TRUE, levels_ratio = FALSE, ...)

opt_pull_table(nrows = "exact", ...)
opt_pull_character(
  values = TRUE,
  max_uniq_to_pull = 10,
  nchar = TRUE,
  na_ratio = TRUE,
  levels_ratio = TRUE,
  ...
)

opt_pull_numeric(
  values = TRUE,
  max_uniq_to_pull = 10,
  range = TRUE,
  precision = TRUE,
  scale = TRUE,
  na_ratio = TRUE,
  levels_ratio = FALSE,
  ...
)

opt_pull_integer(
  values = TRUE,
  max_uniq_to_pull = 10,
  range = TRUE,
  na_ratio = TRUE,
  levels_ratio = FALSE,
  ...
)

opt_pull_date(range = TRUE, na_ratio = TRUE, levels_ratio = FALSE, ...)

opt_pull_logical(na_ratio = TRUE, levels_ratio = FALSE, ...)

opt_pull_table(nrows = "exact", ...)

Arguments

`values`	Should column unique values be sourced? If so the ones are stored as an array withing `values` parameter.
`max_uniq_to_pull`	Pull unique values only when the distinct number of them is less than provided value. The parameter prevents for sourcing large amount of values to configuration file for example when dealing with ids column.
`nchar`	Should maximum number of characters in column be pulled? Is so stored as `nchar` parameter in configuration YAML file.
`na_ratio`	Should ratio of NA values existing in column be sourced?
`levels_ratio`	Should ratio of unique column values be sourced?
`...`	Other parameters defining column metadata source. Currently unsupported.
`range`	Should column range be sourced? Is so stored as `range` parameter in configuration YAML file.
`precision`	Currently unused.
`scale`	Currently unused.
`nrows`	Should number of original columns be sourced? When 'exact' stored as a `nrows` parameter for each table in YAML configuration file. When 'ratio' stored as a fraction of original columns (based on all tables) and saved as `nrows` configuration parameter. When 'none' tables rows information will not be saved.

Set of functions defining special simulation methods for column and its type

Description

Whenever there's a need to simulate column using specific function (as a spec parameter in YAML configuration file), such method should be defined in one of opt_simul_spec_<column_type> functions.

Usage

opt_simul_spec_character(name = simul_spec_character_name, ...)

opt_simul_spec_numeric(distr = simul_spec_numeric_distr, ...)

opt_simul_spec_integer(distr = simul_spec_integer_distr, ...)

opt_simul_spec_logical(distr = simul_spec_logical_distr, ...)

opt_simul_spec_date(distr = simul_spec_date_distr, ...)
opt_simul_spec_character(name = simul_spec_character_name, ...)

opt_simul_spec_numeric(distr = simul_spec_numeric_distr, ...)

opt_simul_spec_integer(distr = simul_spec_integer_distr, ...)

opt_simul_spec_logical(distr = simul_spec_logical_distr, ...)

opt_simul_spec_date(distr = simul_spec_date_distr, ...)

Arguments

`name`	Function for simulating personal names.
`...`	Other custom special methods.
`distr`	Function for simulating data from desired distribution.

Details

Currently defined special methods are:

name For character column, that allows to simulate character reflecting real names and surnames
distr For all the remaining column types. The method allows to simulate data with specified distribution generator, such as rnorm, rbinom etc.

Each 'spec' method receives n parameter (the desired number of rows to simulate), all the default column-based parameters (type, unique, not_null, etc.) but also a special one named spec_params that are applied to selected distribution simulation method.

See for example simul_spec_character_name definition.

Package 'DataFakeR'

Help Index

Setup default column type parameters

Description

Usage

Arguments

Default options for pulling metadata and data simulation

Description

Usage

Arguments

Format

Details

Methods for extracting number of target rows in simulation

Description

Usage

Arguments

Details

Configure data simulation options

Description

Usage

Arguments

Simulate data restricted by extra column parameters

Description

Usage

Arguments

Details

Modify sample with desired condition

Description

Usage

Arguments

Details

Examples

Schema object methods

Description

Usage

Arguments

Details

Source schema file into dependency graph object

Description

Usage

Arguments

Details

Character type simulation methods

Description

Usage

Arguments

Date type simulation methods

Description

Usage

Arguments

Integer type simulation methods

Description

Usage

Arguments

Logical type simulation methods

Description

Usage

Arguments

Numeric type simulation methods

Description

Usage

Arguments

Specify YAML configuration options while pulling the schema from DB

Description

Usage

Arguments

Set of functions defining special simulation methods for column and its type

Description

Usage

Arguments

Details