Introduction to iClusterVB

iClusterVB

iClusterVB allows for fast integrative clustering and feature selection for high dimensional data.

Using a variational Bayes approach, its key features - clustering of mixed-type data, automated determination of the number of clusters, and feature selection in high-dimensional settings - address the limitations of traditional clustering methods while offering an alternative and potentially faster approach than MCMC algorithms, making iClusterVB a valuable tool for contemporary data analysis challenges.

There is a simulated dataset included as a list in the package that we can use to illustrate iClusterVB.

Data pre-processing

library(iClusterVB)

# sim_data comes with the iClusterVB package.
dat1 <- list(
  gauss_1 = sim_data$continuous1_data[c(1:20, 61:80, 121:140, 181:200), 1:75],
  gauss_2 = sim_data$continuous2_data[c(1:20, 61:80, 121:140, 181:200), 1:75],
  poisson_1 = sim_data$count_data[c(1:20, 61:80, 121:140, 181:200), 1:75],
  multinomial_1 = sim_data$binary_data[c(1:20, 61:80, 121:140, 181:200), 1:75]
)

# We re-code `0`s to `2`s

dat1$multinomial_1[dat1$multinomial_1 == 0] <- 2

dist <- c(
  "gaussian", "gaussian",
  "poisson", "multinomial"
)

Running the model

fit_iClusterVB <- iClusterVB(
  mydata = dat1,
  dist = dist,
  K = 4,
  initial_method = "VarSelLCM",
  VS_method = 1,
  max_iter = 50
)
#> ------------------------------------------------------------
#> Pre-processing and initializing the model
#> ------------------------------------------------------------
#> 
#> ------------------------------------------------------------
#> Running the CAVI algorithm
#> ------------------------------------------------------------
#> iteration = 10 elbo = -1987757.206577  
#> iteration = 20 elbo = -1937584.919018  
#> iteration = 30 elbo = -1885755.424734  
#> iteration = 40 elbo = -1852764.123269  
#> iteration = 50 elbo = -1837257.997981

Summary of the Model

# We can obtain a summary using summary()
summary(fit_iClusterVB)
#> Total number of individuals:
#> [1] 80
#> 
#> User-inputted maximum number of clusters: 4
#> Number of clusters determined by algorithm: 4
#> 
#> Cluster Membership:
#>  1  2  3  4 
#> 20 20 20 20 
#> 
#> # of variables above the posterior inclusion probability of 0.5 for View 1 - gaussian
#> [1] "50 out of a total of 75"
#> 
#> # of variables above the posterior inclusion probability of 0.5 for View 2 - gaussian
#> [1] "51 out of a total of 75"
#> 
#> # of variables above the posterior inclusion probability of 0.5 for View 3 - poisson
#> [1] "52 out of a total of 75"
#> 
#> # of variables above the posterior inclusion probability of 0.5 for View 4 - multinomial
#> [1] "51 out of a total of 75"

Generic Plots

plot(fit_iClusterVB)

Probability of Inclusion Plots

# The `piplot` function can be used to visualize the probability of inclusion

piplot(fit_iClusterVB)

Heat maps to visualize the clusters

# The `chmap` function can be used to display heat maps for each data view

chmap(fit_iClusterVB, rho = 0,
      cols = c("green", "blue",
               "purple", "red"),
      scale = "none")