Title: | MultiDimensional Feature Selection |
---|---|
Description: | Functions for MultiDimensional Feature Selection (MDFS): calculating multidimensional information gains, scoring variables, finding important variables, plotting selection results. This package includes an optional CUDA implementation that speeds up information gain calculation using NVIDIA GPGPUs. R. Piliszek et al. (2019) <doi:10.32614/RJ-2019-019>. |
Authors: | Radosław Piliszek [aut, cre], Krzysztof Mnich [aut], Paweł Tabaszewski [aut], Szymon Migacz [aut], Andrzej Sułecki [aut], Witold Remigiusz Rudnicki [aut] |
Maintainer: | Radosław Piliszek <[email protected]> |
License: | GPL-3 |
Version: | 1.5.3 |
Built: | 2024-11-05 07:14:34 UTC |
Source: | https://github.com/cranhaven/cranhaven.r-universe.dev |
This function is deprecated. Please use GenContrastVariables instead.
AddContrastVariables(data, n.contrast = max(ncol(data)/10, 30))
AddContrastVariables(data, n.contrast = max(ncol(data)/10, 30))
data |
data organized in matrix with separate variables in columns |
n.contrast |
number of constrast variables (defaults to max of 1/10 of variables number and 30) |
A list with the following key names:
indices
– vector of indices of input variables used to construct contrast variables
x
– data with constrast variables appended to it
mask
– vector of booleans making it easy to select just contrast variables
as.data.frame S3 method implementation for MDFS
## S3 method for class 'MDFS' as.data.frame(x, ...)
## S3 method for class 'MDFS' as.data.frame(x, ...)
x |
an MDFS object |
... |
ignored |
data.frame
Interesting tuples
ComputeInterestingTuples( data, decision = NULL, dimensions = 2, divisions = 1, discretizations = 1, seed = NULL, range = NULL, pc.xi = 0.25, ig.thr = 0, I.lower = NULL, interesting.vars = vector(mode = "integer"), require.all.vars = FALSE, return.matrix = FALSE, stat_mode = "MI", average = FALSE )
ComputeInterestingTuples( data, decision = NULL, dimensions = 2, divisions = 1, discretizations = 1, seed = NULL, range = NULL, pc.xi = 0.25, ig.thr = 0, I.lower = NULL, interesting.vars = vector(mode = "integer"), require.all.vars = FALSE, return.matrix = FALSE, stat_mode = "MI", average = FALSE )
data |
input data where columns are variables and rows are observations (all numeric) |
decision |
decision variable as a binary sequence of length equal to number of observations |
dimensions |
number of dimensions (a positive integer; 5 max) |
divisions |
number of divisions (from 1 to 15) |
discretizations |
number of discretizations |
seed |
seed for PRNG used during discretizations ( |
range |
discretization range (from 0.0 to 1.0; |
pc.xi |
parameter xi used to compute pseudocounts (the default is recommended not to be changed) |
ig.thr |
IG threshold above which the tuple is interesting (0 and negative mean no filtering) |
I.lower |
IG values computed for lower dimension (1D for 2D, etc.) |
interesting.vars |
variables for which to check the IGs (none = all) |
require.all.vars |
boolean whether to require tuple to consist of only interesting.vars |
return.matrix |
boolean whether to return a matrix instead of a list (ignored if not using the optimised method variant) |
stat_mode |
character, one of: "MI" (mutual information, the default; becomes information gain when |
average |
boolean whether to average over discretisations instead of maximising (the default) |
If running in 2D and no filtering is applied, this function is able to run in an optimised fashion. It is recommended to avoid filtering in 2D if only it is feasible.
This function calculates what stat_mode
dictates.
When decision
is omitted, the stat_mode
is calculated on the descriptive variables.
When decision
is given, the stat_mode
is calculated on the decision variable, conditional on the other variables.
Translate "IG" to that value in the rest of this function's description.
A data.frame
or NULL
(following a warning) if no tuples are found.
The following columns are present in the data.frame
:
Var
– interesting variable index
Tuple.1, Tuple.2, ...
– corresponding tuple (up to dimensions
columns)
IG
– information gain achieved by var
in Tuple.*
Additionally attribute named run.params
with run parameters is set on the result.
ig.1d <- ComputeMaxInfoGains(madelon$data, madelon$decision, dimensions = 1, divisions = 1, range = 0, seed = 0) ComputeInterestingTuples(madelon$data, madelon$decision, dimensions = 2, divisions = 1, range = 0, seed = 0, ig.thr = 100, I.lower = ig.1d$IG)
ig.1d <- ComputeMaxInfoGains(madelon$data, madelon$decision, dimensions = 1, divisions = 1, range = 0, seed = 0) ComputeInterestingTuples(madelon$data, madelon$decision, dimensions = 2, divisions = 1, range = 0, seed = 0, ig.thr = 100, I.lower = ig.1d$IG)
Interesting tuples (discrete)
ComputeInterestingTuplesDiscrete( data, decision = NULL, dimensions = 2, pc.xi = 0.25, ig.thr = 0, I.lower = NULL, interesting.vars = vector(mode = "integer"), require.all.vars = FALSE, return.matrix = FALSE, stat_mode = "MI" )
ComputeInterestingTuplesDiscrete( data, decision = NULL, dimensions = 2, pc.xi = 0.25, ig.thr = 0, I.lower = NULL, interesting.vars = vector(mode = "integer"), require.all.vars = FALSE, return.matrix = FALSE, stat_mode = "MI" )
data |
input data where columns are variables and rows are observations (all discrete with the same number of categories) |
decision |
decision variable as a binary sequence of length equal to number of observations |
dimensions |
number of dimensions (a positive integer; 5 max) |
pc.xi |
parameter xi used to compute pseudocounts (the default is recommended not to be changed) |
ig.thr |
IG threshold above which the tuple is interesting (0 and negative mean no filtering) |
I.lower |
IG values computed for lower dimension (1D for 2D, etc.) |
interesting.vars |
variables for which to check the IGs (none = all) |
require.all.vars |
boolean whether to require tuple to consist of only interesting.vars |
return.matrix |
boolean whether to return a matrix instead of a list (ignored if not using the optimised method variant) |
stat_mode |
character, one of: "MI" (mutual information, the default; becomes information gain when |
If running in 2D and no filtering is applied, this function is able to run in an optimised fashion. It is recommended to avoid filtering in 2D if only it is feasible.
This function calculates what stat_mode
dictates.
When decision
is omitted, the stat_mode
is calculated on the descriptive variables.
When decision
is given, the stat_mode
is calculated on the decision variable, conditional on the other variables.
Translate "IG" to that value in the rest of this function's description.
A data.frame
or NULL
(following a warning) if no tuples are found.
The following columns are present in the data.frame
:
Var
– interesting variable index
Tuple.1, Tuple.2, ...
– corresponding tuple (up to dimensions
columns)
IG
– information gain achieved by var
in Tuple.*
Additionally attribute named run.params
with run parameters is set on the result.
ig.1d <- ComputeMaxInfoGainsDiscrete(madelon$data > 500, madelon$decision, dimensions = 1) ComputeInterestingTuplesDiscrete(madelon$data > 500, madelon$decision, dimensions = 2, ig.thr = 100, I.lower = ig.1d$IG)
ig.1d <- ComputeMaxInfoGainsDiscrete(madelon$data > 500, madelon$decision, dimensions = 1) ComputeInterestingTuplesDiscrete(madelon$data > 500, madelon$decision, dimensions = 2, ig.thr = 100, I.lower = ig.1d$IG)
Max information gains
ComputeMaxInfoGains( data, decision, contrast_data = NULL, dimensions = 1, divisions = 1, discretizations = 1, seed = NULL, range = NULL, pc.xi = 0.25, return.tuples = FALSE, interesting.vars = vector(mode = "integer"), require.all.vars = FALSE, use.CUDA = FALSE )
ComputeMaxInfoGains( data, decision, contrast_data = NULL, dimensions = 1, divisions = 1, discretizations = 1, seed = NULL, range = NULL, pc.xi = 0.25, return.tuples = FALSE, interesting.vars = vector(mode = "integer"), require.all.vars = FALSE, use.CUDA = FALSE )
data |
input data where columns are variables and rows are observations (all numeric) |
decision |
decision variable as a binary sequence of length equal to number of observations |
contrast_data |
the contrast counterpart of data, has to have the same number of observations - not supported with CUDA |
dimensions |
number of dimensions (a positive integer; 5 max) |
divisions |
number of divisions (from 1 to 15; additionally limited by dimensions if using CUDA) |
discretizations |
number of discretizations |
seed |
seed for PRNG used during discretizations ( |
range |
discretization range (from 0.0 to 1.0; |
pc.xi |
parameter xi used to compute pseudocounts (the default is recommended not to be changed) |
return.tuples |
whether to return tuples (and relevant discretization number) where max IG was observed (one tuple and relevant discretization number per variable) - not supported with CUDA nor in 1D |
interesting.vars |
variables for which to check the IGs (none = all) - not supported with CUDA |
require.all.vars |
boolean whether to require tuple to consist of only interesting.vars |
use.CUDA |
whether to use CUDA acceleration (must be compiled with CUDA) |
A data.frame
with the following columns:
IG
– max information gain (of each variable)
Tuple.1, Tuple.2, ...
– corresponding tuple (up to dimensions
columns, available only when return.tuples == T
)
Discretization.nr
– corresponding discretization number (available only when return.tuples == T
)
Additionally attribute named run.params
with run parameters is set on the result.
ComputeMaxInfoGains(madelon$data, madelon$decision, dimensions = 2, divisions = 1, range = 0, seed = 0)
ComputeMaxInfoGains(madelon$data, madelon$decision, dimensions = 2, divisions = 1, range = 0, seed = 0)
Max information gains (discrete)
ComputeMaxInfoGainsDiscrete( data, decision, contrast_data = NULL, dimensions = 1, pc.xi = 0.25, return.tuples = FALSE, interesting.vars = vector(mode = "integer"), require.all.vars = FALSE )
ComputeMaxInfoGainsDiscrete( data, decision, contrast_data = NULL, dimensions = 1, pc.xi = 0.25, return.tuples = FALSE, interesting.vars = vector(mode = "integer"), require.all.vars = FALSE )
data |
input data where columns are variables and rows are observations (all discrete with the same number of categories) |
decision |
decision variable as a binary sequence of length equal to number of observations |
contrast_data |
the contrast counterpart of data, has to have the same number of observations |
dimensions |
number of dimensions (a positive integer; 5 max) |
pc.xi |
parameter xi used to compute pseudocounts (the default is recommended not to be changed) |
return.tuples |
whether to return tuples where max IG was observed (one tuple per variable) - not supported with CUDA nor in 1D |
interesting.vars |
variables for which to check the IGs (none = all) - not supported with CUDA |
require.all.vars |
boolean whether to require tuple to consist of only interesting.vars |
A data.frame
with the following columns:
IG
– max information gain (of each variable)
Tuple.1, Tuple.2, ...
– corresponding tuple (up to dimensions
columns, available only when return.tuples == T
)
Discretization.nr
– always 1 (for compatibility with the non-discrete function; available only when return.tuples == T
)
Additionally attribute named run.params
with run parameters is set on the result.
ComputeMaxInfoGainsDiscrete(madelon$data > 500, madelon$decision, dimensions = 2)
ComputeMaxInfoGainsDiscrete(madelon$data > 500, madelon$decision, dimensions = 2)
Compute p-values from information gains and return MDFS
ComputePValue( IG, dimensions, divisions, response.divisions = 1, df = NULL, contrast.mask = NULL, ig.in.bits = TRUE, ig.doubled = FALSE, one.dim.mode = "exp", irr.vars.num = NULL, ign.low.ig.vars.num = NULL, min.irr.vars.num = NULL, max.ign.low.ig.vars.num = NULL, search.points = 8, level = 0.05 )
ComputePValue( IG, dimensions, divisions, response.divisions = 1, df = NULL, contrast.mask = NULL, ig.in.bits = TRUE, ig.doubled = FALSE, one.dim.mode = "exp", irr.vars.num = NULL, ign.low.ig.vars.num = NULL, min.irr.vars.num = NULL, max.ign.low.ig.vars.num = NULL, search.points = 8, level = 0.05 )
IG |
max conditional information gains |
dimensions |
number of dimensions |
divisions |
number of divisions |
response.divisions |
number of response divisions (i.e. categories-1) |
df |
vector of degrees of freedom for each variable (optional) |
contrast.mask |
boolean mask on |
ig.in.bits |
|
ig.doubled |
|
one.dim.mode |
|
irr.vars.num |
if not NULL, number of irrelevant variables, specified by the user |
ign.low.ig.vars.num |
if not NULL, number of ignored low IG variables, specified by the user |
min.irr.vars.num |
minimum number of irrelevant variables ( |
max.ign.low.ig.vars.num |
maximum number of ignored low IG variables ( |
search.points |
number of points in search procedure for the optimal number of ignored variables |
level |
acceptable error level of goodness-of-fit one-sample Kolmogorov-Smirnov test (used only for warning) |
A data.frame
with class set to MDFS
. Can be coerced back to data.frame
using as.data.frame
.
The following columns are present:
IG
– information gains (input copy)
chi.squared.p.value
– chi-squared p-values
p.value
– theoretical p-values
Additionally the following attributes
are set:
run.params
– run parameters
sq.dev
– vector of square deviations used to estimate the number of irrelevant variables
dist.param
– distribution parameter
err.param
– squared error of the distribution parameter
fit.p.value
– p-value of fit
ComputePValue(madelon$IG.2D, dimensions = 2, divisions = 1)
ComputePValue(madelon$IG.2D, dimensions = 2, divisions = 1)
Discretize variable on demand
Discretize(data, variable.idx, divisions, discretization.nr, seed, range)
Discretize(data, variable.idx, divisions, discretization.nr, seed, range)
data |
input data where columns are variables and rows are observations (all numeric) |
variable.idx |
variable index (as it appears in |
divisions |
number of divisions |
discretization.nr |
discretization number (positive integer) |
seed |
seed for PRNG |
range |
discretization range |
Discretized variable.
Discretize(madelon$data, 3, 1, 1, 0, 0.5)
Discretize(madelon$data, 3, 1, 1, 0, 0.5)
Generate contrast variables from data
GenContrastVariables(data, n.contrast = max(ncol(data), 30))
GenContrastVariables(data, n.contrast = max(ncol(data), 30))
data |
data organized in matrix with separate variables in columns |
n.contrast |
number of constrast variables (defaults to max of 1/10 of variables number and 30) |
A list with the following key names:
indices
– vector of indices of input variables used to construct contrast variables
x
– data with constrast variables appended to it
mask
– vector of booleans making it easy to select just contrast variables
GenContrastVariables(madelon$data)
GenContrastVariables(madelon$data)
Get the recommended range for multiple discretisations
GetRange(k = 3, n, dimensions, divisions = 1)
GetRange(k = 3, n, dimensions, divisions = 1)
k |
the assumed minimum number of objects in a bucket (the default is the recommended value) |
n |
the total number of objects considered |
dimensions |
the number of dimensions of analysis |
divisions |
the number of divisions of discretisations |
The recommended range value (a floating point number).
GetRange(n = 250, dimensions = 2)
GetRange(n = 250, dimensions = 2)
An artificial dataset containing data points grouped in 32 clusters placed on the vertices of a five dimensional hypercube and randomly labeled 0/1.
madelon
madelon
A list of two elements:
2000 by 500 matrix of 2000 objects with 500 features
vector of 2000 decisions (labels 0/1)
example 2D IG computed using ComputeMaxInfoGains
The five dimensions constitute 5 informative features. 15 linear combinations of those features are added to form a set of 20 (redundant) informative features. There are 480 distractor features called 'probes' having no predictive power.
Included is the original training set with label -1 changed to 0.
https://archive.ics.uci.edu/ml/datasets/Madelon
Run end-to-end MDFS
MDFS( data, decision, n.contrast = max(ncol(data), 30), dimensions = 1, divisions = 1, discretizations = 1, range = NULL, pc.xi = 0.25, p.adjust.method = "holm", level = 0.05, seed = NULL, use.CUDA = FALSE )
MDFS( data, decision, n.contrast = max(ncol(data), 30), dimensions = 1, divisions = 1, discretizations = 1, range = NULL, pc.xi = 0.25, p.adjust.method = "holm", level = 0.05, seed = NULL, use.CUDA = FALSE )
data |
input data where columns are variables and rows are observations (all numeric) |
decision |
decision variable as a boolean vector of length equal to number of observations |
n.contrast |
number of constrast variables (defaults to max of 1/10 of variables number and 30) |
dimensions |
number of dimensions (a positive integer; on CUDA limited to 2–5 range) |
divisions |
number of divisions (from 1 to 15) |
discretizations |
number of discretizations |
range |
discretization range (from 0.0 to 1.0; |
pc.xi |
parameter xi used to compute pseudocounts (the default is recommended not to be changed) |
p.adjust.method |
method as accepted by |
level |
statistical significance level |
seed |
seed for PRNG used during discretizations ( |
use.CUDA |
whether to use CUDA acceleration (must be compiled with CUDA; NOTE: the CUDA version might provide a slightly lower sensitivity due to a lack of native support for |
In case of FDR control it is recommended to use Benjamini-Hochberg-Yekutieli p-value adjustment
method ("BY"
in p.adjust
) due to unknown dependencies between tests.
A list
with the following fields:
contrast.indices
– indices of variables chosen to build contrast variables
contrast.variables
– built contrast variables
MIG.Result
– result of ComputeMaxInfoGains
MDFS
– result of ComputePValue (the MDFS object)
statistic
– vector of statistic's values (IGs) for corresponding variables
p.value
– vector of p-values for corresponding variables
adjusted.p.value
– vector of adjusted p-values for corresponding variables
relevant.variables
– vector of relevant variables indices
MDFS(madelon$data, madelon$decision, dimensions = 2, divisions = 1, range = 0, seed = 0)
MDFS(madelon$data, madelon$decision, dimensions = 2, divisions = 1, range = 0, seed = 0)
Call omp_set_num_threads
mdfs_omp_set_num_threads(num_threads)
mdfs_omp_set_num_threads(num_threads)
num_threads |
input data where columns are variables and rows are observations (all numeric) |
Plot MDFS details
## S3 method for class 'MDFS' plot(x, plots = c("ig", "c", "p"), ...)
## S3 method for class 'MDFS' plot(x, plots = c("ig", "c", "p"), ...)
x |
an MDFS object |
plots |
plots to plot (ig for max IG, c for chi-squared p-values, p for p-values) |
... |
passed on to |
Find indices of relevant variables
RelevantVariables(fs, ...)
RelevantVariables(fs, ...)
fs |
feature selector |
... |
arguments passed to methods |
indices of important variables
Find indices of relevant variables from MDFS
## S3 method for class 'MDFS' RelevantVariables(fs, level = 0.05, p.adjust.method = "holm", ...)
## S3 method for class 'MDFS' RelevantVariables(fs, level = 0.05, p.adjust.method = "holm", ...)
fs |
an MDFS object |
level |
statistical significance level |
p.adjust.method |
method as accepted by |
... |
ignored |
In case of FDR control it is recommended to use Benjamini-Hochberg-Yekutieli p-value adjustment
method ("BY"
in p.adjust
) due to unknown dependencies between tests.
indices of relevant variables