Package 'MDFS' reference manual

Title:	MultiDimensional Feature Selection
Description:	Functions for MultiDimensional Feature Selection (MDFS): calculating multidimensional information gains, scoring variables, finding important variables, plotting selection results. This package includes an optional CUDA implementation that speeds up information gain calculation using NVIDIA GPGPUs. R. Piliszek et al. (2019) <doi:10.32614/RJ-2019-019>.
Authors:	Radosław Piliszek [aut, cre], Krzysztof Mnich [aut], Paweł Tabaszewski [aut], Szymon Migacz [aut], Andrzej Sułecki [aut], Witold Remigiusz Rudnicki [aut]
Maintainer:	Radosław Piliszek <[email protected]>
License:	GPL-3
Version:	1.5.3
Built:	2024-11-05 07:14:34 UTC
Source:	https://github.com/cranhaven/cranhaven.r-universe.dev

Add contrast variables to data

Description

This function is deprecated. Please use GenContrastVariables instead.

Usage

AddContrastVariables(data, n.contrast = max(ncol(data)/10, 30))
AddContrastVariables(data, n.contrast = max(ncol(data)/10, 30))

Arguments

`data`	data organized in matrix with separate variables in columns
`n.contrast`	number of constrast variables (defaults to max of 1/10 of variables number and 30)

Value

A list with the following key names:

indices – vector of indices of input variables used to construct contrast variables
x – data with constrast variables appended to it
mask – vector of booleans making it easy to select just contrast variables

as.data.frame S3 method implementation for MDFS

Description

as.data.frame S3 method implementation for MDFS

Usage

## S3 method for class 'MDFS'
as.data.frame(x, ...)
## S3 method for class 'MDFS'
as.data.frame(x, ...)

Arguments

`x`	an MDFS object
`...`	ignored

Value

data.frame

Interesting tuples

Description

Interesting tuples

Usage

ComputeInterestingTuples(
  data,
  decision = NULL,
  dimensions = 2,
  divisions = 1,
  discretizations = 1,
  seed = NULL,
  range = NULL,
  pc.xi = 0.25,
  ig.thr = 0,
  I.lower = NULL,
  interesting.vars = vector(mode = "integer"),
  require.all.vars = FALSE,
  return.matrix = FALSE,
  stat_mode = "MI",
  average = FALSE
)
ComputeInterestingTuples(
  data,
  decision = NULL,
  dimensions = 2,
  divisions = 1,
  discretizations = 1,
  seed = NULL,
  range = NULL,
  pc.xi = 0.25,
  ig.thr = 0,
  I.lower = NULL,
  interesting.vars = vector(mode = "integer"),
  require.all.vars = FALSE,
  return.matrix = FALSE,
  stat_mode = "MI",
  average = FALSE
)

Arguments

`data`	input data where columns are variables and rows are observations (all numeric)
`decision`	decision variable as a binary sequence of length equal to number of observations
`dimensions`	number of dimensions (a positive integer; 5 max)
`divisions`	number of divisions (from 1 to 15)
`discretizations`	number of discretizations
`seed`	seed for PRNG used during discretizations (`NULL` for random)
`range`	discretization range (from 0.0 to 1.0; `NULL` selects probable optimal number)
`pc.xi`	parameter xi used to compute pseudocounts (the default is recommended not to be changed)
`ig.thr`	IG threshold above which the tuple is interesting (0 and negative mean no filtering)
`I.lower`	IG values computed for lower dimension (1D for 2D, etc.)
`interesting.vars`	variables for which to check the IGs (none = all)
`require.all.vars`	boolean whether to require tuple to consist of only interesting.vars
`return.matrix`	boolean whether to return a matrix instead of a list (ignored if not using the optimised method variant)
`stat_mode`	character, one of: "MI" (mutual information, the default; becomes information gain when `decision` is given), "H" (entropy; becomes conditional entropy when `decision` is given), "VI" (variation of information; becomes target information difference when `decision` is given); decides on the value computed
`average`	boolean whether to average over discretisations instead of maximising (the default)

Details

If running in 2D and no filtering is applied, this function is able to run in an optimised fashion. It is recommended to avoid filtering in 2D if only it is feasible.

This function calculates what stat_mode dictates. When decision is omitted, the stat_mode is calculated on the descriptive variables. When decision is given, the stat_mode is calculated on the decision variable, conditional on the other variables. Translate "IG" to that value in the rest of this function's description.

Value

A data.frame or NULL (following a warning) if no tuples are found.

The following columns are present in the data.frame:

Var – interesting variable index
Tuple.1, Tuple.2, ... – corresponding tuple (up to dimensions columns)
IG – information gain achieved by var in Tuple.*

Additionally attribute named run.params with run parameters is set on the result.

Examples


ig.1d <- ComputeMaxInfoGains(madelon$data, madelon$decision, dimensions = 1, divisions = 1,
                             range = 0, seed = 0)
ComputeInterestingTuples(madelon$data, madelon$decision, dimensions = 2, divisions = 1,
                         range = 0, seed = 0, ig.thr = 100, I.lower = ig.1d$IG)

ig.1d <- ComputeMaxInfoGains(madelon$data, madelon$decision, dimensions = 1, divisions = 1,
                             range = 0, seed = 0)
ComputeInterestingTuples(madelon$data, madelon$decision, dimensions = 2, divisions = 1,
                         range = 0, seed = 0, ig.thr = 100, I.lower = ig.1d$IG)

Interesting tuples (discrete)

Description

Interesting tuples (discrete)

Usage

ComputeInterestingTuplesDiscrete(
  data,
  decision = NULL,
  dimensions = 2,
  pc.xi = 0.25,
  ig.thr = 0,
  I.lower = NULL,
  interesting.vars = vector(mode = "integer"),
  require.all.vars = FALSE,
  return.matrix = FALSE,
  stat_mode = "MI"
)
ComputeInterestingTuplesDiscrete(
  data,
  decision = NULL,
  dimensions = 2,
  pc.xi = 0.25,
  ig.thr = 0,
  I.lower = NULL,
  interesting.vars = vector(mode = "integer"),
  require.all.vars = FALSE,
  return.matrix = FALSE,
  stat_mode = "MI"
)

Arguments

`data`	input data where columns are variables and rows are observations (all discrete with the same number of categories)
`decision`	decision variable as a binary sequence of length equal to number of observations
`dimensions`	number of dimensions (a positive integer; 5 max)
`pc.xi`	parameter xi used to compute pseudocounts (the default is recommended not to be changed)
`ig.thr`	IG threshold above which the tuple is interesting (0 and negative mean no filtering)
`I.lower`	IG values computed for lower dimension (1D for 2D, etc.)
`interesting.vars`	variables for which to check the IGs (none = all)
`require.all.vars`	boolean whether to require tuple to consist of only interesting.vars
`return.matrix`	boolean whether to return a matrix instead of a list (ignored if not using the optimised method variant)
`stat_mode`	character, one of: "MI" (mutual information, the default; becomes information gain when `decision` is given), "H" (entropy; becomes conditional entropy when `decision` is given), "VI" (variation of information; becomes target information difference when `decision` is given); decides on the value computed

Details

If running in 2D and no filtering is applied, this function is able to run in an optimised fashion. It is recommended to avoid filtering in 2D if only it is feasible.

Value

A data.frame or NULL (following a warning) if no tuples are found.

The following columns are present in the data.frame:

Var – interesting variable index
Tuple.1, Tuple.2, ... – corresponding tuple (up to dimensions columns)
IG – information gain achieved by var in Tuple.*

Additionally attribute named run.params with run parameters is set on the result.

Examples


ig.1d <- ComputeMaxInfoGainsDiscrete(madelon$data > 500, madelon$decision, dimensions = 1)
ComputeInterestingTuplesDiscrete(madelon$data > 500, madelon$decision, dimensions = 2,
                                 ig.thr = 100, I.lower = ig.1d$IG)

ig.1d <- ComputeMaxInfoGainsDiscrete(madelon$data > 500, madelon$decision, dimensions = 1)
ComputeInterestingTuplesDiscrete(madelon$data > 500, madelon$decision, dimensions = 2,
                                 ig.thr = 100, I.lower = ig.1d$IG)

Max information gains

Description

Max information gains

Usage

ComputeMaxInfoGains(
  data,
  decision,
  contrast_data = NULL,
  dimensions = 1,
  divisions = 1,
  discretizations = 1,
  seed = NULL,
  range = NULL,
  pc.xi = 0.25,
  return.tuples = FALSE,
  interesting.vars = vector(mode = "integer"),
  require.all.vars = FALSE,
  use.CUDA = FALSE
)
ComputeMaxInfoGains(
  data,
  decision,
  contrast_data = NULL,
  dimensions = 1,
  divisions = 1,
  discretizations = 1,
  seed = NULL,
  range = NULL,
  pc.xi = 0.25,
  return.tuples = FALSE,
  interesting.vars = vector(mode = "integer"),
  require.all.vars = FALSE,
  use.CUDA = FALSE
)

Arguments

`data`	input data where columns are variables and rows are observations (all numeric)
`decision`	decision variable as a binary sequence of length equal to number of observations
`contrast_data`	the contrast counterpart of data, has to have the same number of observations - not supported with CUDA
`dimensions`	number of dimensions (a positive integer; 5 max)
`divisions`	number of divisions (from 1 to 15; additionally limited by dimensions if using CUDA)
`discretizations`	number of discretizations
`seed`	seed for PRNG used during discretizations (`NULL` for random)
`range`	discretization range (from 0.0 to 1.0; `NULL` selects probable optimal number)
`pc.xi`	parameter xi used to compute pseudocounts (the default is recommended not to be changed)
`return.tuples`	whether to return tuples (and relevant discretization number) where max IG was observed (one tuple and relevant discretization number per variable) - not supported with CUDA nor in 1D
`interesting.vars`	variables for which to check the IGs (none = all) - not supported with CUDA
`require.all.vars`	boolean whether to require tuple to consist of only interesting.vars
`use.CUDA`	whether to use CUDA acceleration (must be compiled with CUDA)

Value

A data.frame with the following columns:

IG – max information gain (of each variable)
Tuple.1, Tuple.2, ... – corresponding tuple (up to dimensions columns, available only when return.tuples == T)
Discretization.nr – corresponding discretization number (available only when return.tuples == T)

Additionally attribute named run.params with run parameters is set on the result.

Examples


ComputeMaxInfoGains(madelon$data, madelon$decision, dimensions = 2, divisions = 1,
                    range = 0, seed = 0)

ComputeMaxInfoGains(madelon$data, madelon$decision, dimensions = 2, divisions = 1,
                    range = 0, seed = 0)

Max information gains (discrete)

Description

Max information gains (discrete)

Usage

ComputeMaxInfoGainsDiscrete(
  data,
  decision,
  contrast_data = NULL,
  dimensions = 1,
  pc.xi = 0.25,
  return.tuples = FALSE,
  interesting.vars = vector(mode = "integer"),
  require.all.vars = FALSE
)
ComputeMaxInfoGainsDiscrete(
  data,
  decision,
  contrast_data = NULL,
  dimensions = 1,
  pc.xi = 0.25,
  return.tuples = FALSE,
  interesting.vars = vector(mode = "integer"),
  require.all.vars = FALSE
)

Arguments

`data`	input data where columns are variables and rows are observations (all discrete with the same number of categories)
`decision`	decision variable as a binary sequence of length equal to number of observations
`contrast_data`	the contrast counterpart of data, has to have the same number of observations
`dimensions`	number of dimensions (a positive integer; 5 max)
`pc.xi`	parameter xi used to compute pseudocounts (the default is recommended not to be changed)
`return.tuples`	whether to return tuples where max IG was observed (one tuple per variable) - not supported with CUDA nor in 1D
`interesting.vars`	variables for which to check the IGs (none = all) - not supported with CUDA
`require.all.vars`	boolean whether to require tuple to consist of only interesting.vars

Value

A data.frame with the following columns:

IG – max information gain (of each variable)
Tuple.1, Tuple.2, ... – corresponding tuple (up to dimensions columns, available only when return.tuples == T)
Discretization.nr – always 1 (for compatibility with the non-discrete function; available only when return.tuples == T)

Additionally attribute named run.params with run parameters is set on the result.

Examples


ComputeMaxInfoGainsDiscrete(madelon$data > 500, madelon$decision, dimensions = 2)

ComputeMaxInfoGainsDiscrete(madelon$data > 500, madelon$decision, dimensions = 2)

Compute p-values from information gains and return MDFS

Description

Compute p-values from information gains and return MDFS

Usage

ComputePValue(
  IG,
  dimensions,
  divisions,
  response.divisions = 1,
  df = NULL,
  contrast.mask = NULL,
  ig.in.bits = TRUE,
  ig.doubled = FALSE,
  one.dim.mode = "exp",
  irr.vars.num = NULL,
  ign.low.ig.vars.num = NULL,
  min.irr.vars.num = NULL,
  max.ign.low.ig.vars.num = NULL,
  search.points = 8,
  level = 0.05
)
ComputePValue(
  IG,
  dimensions,
  divisions,
  response.divisions = 1,
  df = NULL,
  contrast.mask = NULL,
  ig.in.bits = TRUE,
  ig.doubled = FALSE,
  one.dim.mode = "exp",
  irr.vars.num = NULL,
  ign.low.ig.vars.num = NULL,
  min.irr.vars.num = NULL,
  max.ign.low.ig.vars.num = NULL,
  search.points = 8,
  level = 0.05
)

Arguments

`IG`	max conditional information gains
`dimensions`	number of dimensions
`divisions`	number of divisions
`response.divisions`	number of response divisions (i.e. categories-1)
`df`	vector of degrees of freedom for each variable (optional)
`contrast.mask`	boolean mask on `IG` specifying which variables are contrast variables (or `NULL` if none, otherwise at least 3 variables must be marked)
`ig.in.bits`	`TRUE` if input is in binary log (as opposed to natural log)
`ig.doubled`	`TRUE` if input is doubled (to follow the chi-squared distribution)
`one.dim.mode`	`'exp'` for exponential distribution, `'lin'` for linear function of chi-squared or `'raw'` for raw chi-squared
`irr.vars.num`	if not NULL, number of irrelevant variables, specified by the user
`ign.low.ig.vars.num`	if not NULL, number of ignored low IG variables, specified by the user
`min.irr.vars.num`	minimum number of irrelevant variables (`NULL` selects probable optimal number)
`max.ign.low.ig.vars.num`	maximum number of ignored low IG variables (`NULL` selects probable optimal number)
`search.points`	number of points in search procedure for the optimal number of ignored variables
`level`	acceptable error level of goodness-of-fit one-sample Kolmogorov-Smirnov test (used only for warning)

Value

A data.frame with class set to MDFS. Can be coerced back to data.frame using as.data.frame.

The following columns are present:

IG – information gains (input copy)
chi.squared.p.value – chi-squared p-values
p.value – theoretical p-values

Additionally the following attributes are set:

run.params – run parameters
sq.dev – vector of square deviations used to estimate the number of irrelevant variables
dist.param – distribution parameter
err.param – squared error of the distribution parameter
fit.p.value – p-value of fit

Examples

ComputePValue(madelon$IG.2D, dimensions = 2, divisions = 1)
ComputePValue(madelon$IG.2D, dimensions = 2, divisions = 1)

Discretize variable on demand

Description

Discretize variable on demand

Usage

Discretize(data, variable.idx, divisions, discretization.nr, seed, range)
Discretize(data, variable.idx, divisions, discretization.nr, seed, range)

Arguments

`data`	input data where columns are variables and rows are observations (all numeric)
`variable.idx`	variable index (as it appears in `data`)
`divisions`	number of divisions
`discretization.nr`	discretization number (positive integer)
`seed`	seed for PRNG
`range`	discretization range

Value

Discretized variable.

Examples

Discretize(madelon$data, 3, 1, 1, 0, 0.5)
Discretize(madelon$data, 3, 1, 1, 0, 0.5)

Generate contrast variables from data

Description

Generate contrast variables from data

Usage

GenContrastVariables(data, n.contrast = max(ncol(data), 30))
GenContrastVariables(data, n.contrast = max(ncol(data), 30))

Arguments

`data`	data organized in matrix with separate variables in columns
`n.contrast`	number of constrast variables (defaults to max of 1/10 of variables number and 30)

Value

A list with the following key names:

indices – vector of indices of input variables used to construct contrast variables
x – data with constrast variables appended to it
mask – vector of booleans making it easy to select just contrast variables

Examples

GenContrastVariables(madelon$data)
GenContrastVariables(madelon$data)

Get the recommended range for multiple discretisations

Description

Get the recommended range for multiple discretisations

Usage

GetRange(k = 3, n, dimensions, divisions = 1)
GetRange(k = 3, n, dimensions, divisions = 1)

Arguments

`k`	the assumed minimum number of objects in a bucket (the default is the recommended value)
`n`	the total number of objects considered
`dimensions`	the number of dimensions of analysis
`divisions`	the number of divisions of discretisations

Value

The recommended range value (a floating point number).

Examples

GetRange(n = 250, dimensions = 2)
GetRange(n = 250, dimensions = 2)

An artificial dataset called MADELON

Description

An artificial dataset containing data points grouped in 32 clusters placed on the vertices of a five dimensional hypercube and randomly labeled 0/1.

Usage

madelon
madelon

Format

A list of two elements:

data: 2000 by 500 matrix of 2000 objects with 500 features
decision: vector of 2000 decisions (labels 0/1)
IG.2D: example 2D IG computed using ComputeMaxInfoGains

Details

The five dimensions constitute 5 informative features. 15 linear combinations of those features are added to form a set of 20 (redundant) informative features. There are 480 distractor features called 'probes' having no predictive power.

Included is the original training set with label -1 changed to 0.

Source

https://archive.ics.uci.edu/ml/datasets/Madelon

Run end-to-end MDFS

Description

Run end-to-end MDFS

Usage

MDFS(
  data,
  decision,
  n.contrast = max(ncol(data), 30),
  dimensions = 1,
  divisions = 1,
  discretizations = 1,
  range = NULL,
  pc.xi = 0.25,
  p.adjust.method = "holm",
  level = 0.05,
  seed = NULL,
  use.CUDA = FALSE
)
MDFS(
  data,
  decision,
  n.contrast = max(ncol(data), 30),
  dimensions = 1,
  divisions = 1,
  discretizations = 1,
  range = NULL,
  pc.xi = 0.25,
  p.adjust.method = "holm",
  level = 0.05,
  seed = NULL,
  use.CUDA = FALSE
)

Arguments

`data`	input data where columns are variables and rows are observations (all numeric)
`decision`	decision variable as a boolean vector of length equal to number of observations
`n.contrast`	number of constrast variables (defaults to max of 1/10 of variables number and 30)
`dimensions`	number of dimensions (a positive integer; on CUDA limited to 2–5 range)
`divisions`	number of divisions (from 1 to 15)
`discretizations`	number of discretizations
`range`	discretization range (from 0.0 to 1.0; `NULL` selects probable optimal number)
`pc.xi`	parameter xi used to compute pseudocounts (the default is recommended not to be changed)
`p.adjust.method`	method as accepted by `p.adjust` (`"BY"` is recommended for FDR, see Details)
`level`	statistical significance level
`seed`	seed for PRNG used during discretizations (`NULL` for random)
`use.CUDA`	whether to use CUDA acceleration (must be compiled with CUDA; NOTE: the CUDA version might provide a slightly lower sensitivity due to a lack of native support for `contrast_data`)

Details

In case of FDR control it is recommended to use Benjamini-Hochberg-Yekutieli p-value adjustment method ("BY" in p.adjust) due to unknown dependencies between tests.

Value

A list with the following fields:

contrast.indices – indices of variables chosen to build contrast variables
contrast.variables – built contrast variables
MIG.Result – result of ComputeMaxInfoGains
MDFS – result of ComputePValue (the MDFS object)
statistic – vector of statistic's values (IGs) for corresponding variables
p.value – vector of p-values for corresponding variables
adjusted.p.value – vector of adjusted p-values for corresponding variables
relevant.variables – vector of relevant variables indices

Examples


MDFS(madelon$data, madelon$decision, dimensions = 2, divisions = 1,
     range = 0, seed = 0)

MDFS(madelon$data, madelon$decision, dimensions = 2, divisions = 1,
     range = 0, seed = 0)

Call omp_set_num_threads

Description

Call omp_set_num_threads

Usage

mdfs_omp_set_num_threads(num_threads)
mdfs_omp_set_num_threads(num_threads)

Arguments

num_threads

input data where columns are variables and rows are observations (all numeric)

Plot MDFS details

Description

Plot MDFS details

Usage

## S3 method for class 'MDFS'
plot(x, plots = c("ig", "c", "p"), ...)
## S3 method for class 'MDFS'
plot(x, plots = c("ig", "c", "p"), ...)

Arguments

`x`	an MDFS object
`plots`	plots to plot (ig for max IG, c for chi-squared p-values, p for p-values)
`...`	passed on to `plot`

Find indices of relevant variables

Description

Find indices of relevant variables

Usage

RelevantVariables(fs, ...)
RelevantVariables(fs, ...)

Arguments

`fs`	feature selector
`...`	arguments passed to methods

Value

indices of important variables

Find indices of relevant variables from MDFS

Description

Find indices of relevant variables from MDFS

Usage

## S3 method for class 'MDFS'
RelevantVariables(fs, level = 0.05, p.adjust.method = "holm", ...)
## S3 method for class 'MDFS'
RelevantVariables(fs, level = 0.05, p.adjust.method = "holm", ...)

Arguments

`fs`	an MDFS object
`level`	statistical significance level
`p.adjust.method`	method as accepted by `p.adjust` (`"BY"` is recommended for FDR, see Details)
`...`	ignored

Details

In case of FDR control it is recommended to use Benjamini-Hochberg-Yekutieli p-value adjustment method ("BY" in p.adjust) due to unknown dependencies between tests.

Value

indices of relevant variables

Package 'MDFS'

Help Index

Add contrast variables to data

Description

Usage

Arguments

Value

as.data.frame S3 method implementation for MDFS

Description

Usage

Arguments

Value

Interesting tuples

Description

Usage

Arguments

Details

Value

Examples

Interesting tuples (discrete)

Description

Usage

Arguments

Details

Value

Examples

Max information gains

Description

Usage

Arguments

Value

Examples

Max information gains (discrete)

Description

Usage

Arguments

Value

Examples

Compute p-values from information gains and return MDFS

Description

Usage

Arguments

Value

Examples

Discretize variable on demand

Description

Usage

Arguments

Value

Examples

Generate contrast variables from data

Description

Usage

Arguments

Value

Examples

Get the recommended range for multiple discretisations

Description

Usage

Arguments

Value

Examples

An artificial dataset called MADELON

Description

Usage

Format

Details

Source

Run end-to-end MDFS

Description

Usage

Arguments

Details

Value

Examples

Call omp_set_num_threads

Description

Usage

Arguments

Plot MDFS details