Title: | Distance & Density-Based Outlier Detection |
---|---|
Description: | Outlier detection in multidimensional domains. Implementation of notable distance and density-based outlier algorithms. Allows users to identify local outliers by comparing observations to their nearest neighbors, reverse nearest neighbors, shared neighbors or natural neighbors. For distance-based approaches, see Knorr, M., & Ng, R. T. (1997) <doi:10.1145/782010.782021>, Angiulli, F., & Pizzuti, C. (2002) <doi:10.1007/3-540-45681-3_2>, Hautamaki, V., & Ismo, K. (2004) <doi:10.1109/ICPR.2004.1334558> and Zhang, K., Hutter, M. & Jin, H. (2009) <doi:10.1007/978-3-642-01307-2_84>. For density-based approaches, see Tang, J., Chen, Z., Fu, A. W. C., & Cheung, D. W. (2002) <doi:10.1007/3-540-47887-6_53>, Jin, W., Tung, A. K. H., Han, J., & Wang, W. (2006) <doi:10.1007/11731139_68>, Schubert, E., Zimek, A. & Kriegel, H-P. (2014) <doi:10.1137/1.9781611973440.63>, Latecki, L., Lazarevic, A. & Prokrajac, D. (2007) <doi:10.1007/978-3-540-73499-4_6>, Papadimitriou, S., Gibbons, P. B., & Faloutsos, C. (2003) <doi:10.1109/ICDE.2003.1260802>, Breunig, M. M., Kriegel, H.-P., Ng, R. T., & Sander, J. (2000) <doi:10.1145/342009.335388>, Kriegel, H.-P., Kröger, P., Schubert, E., & Zimek, A. (2009) <doi:10.1145/1645953.1646195>, Zhu, Q., Feng, Ji. & Huang, J. (2016) <doi:10.1016/j.patrec.2016.05.007>, Huang, J., Zhu, Q., Yang, L. & Feng, J. (2015) <doi:10.1016/j.knosys.2015.10.014>, Tang, B. & Haibo, He. (2017) <doi:10.1016/j.neucom.2017.02.039> and Gao, J., Hu, W., Zhang, X. & Wu, Ou. (2011) <doi:10.1007/978-3-642-20847-8_23>. |
Authors: | Jacob H. Madsen <[email protected]> |
Maintainer: | Jacob H. Madsen <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.0 |
Built: | 2025-02-17 12:37:44 UTC |
Source: | https://github.com/cranhaven/cranhaven.r-universe.dev |
Function to calculate the connectivity-based outlier factor as an outlier score for observations. Suggested by Tang, J., Chen, Z., Fu, A. W. C., & Cheung, D. W. (2002)
COF(dataset, k = 5)
COF(dataset, k = 5)
dataset |
The dataset for which observations have a COF score returned |
k |
The number of k-nearest neighbors to construct a SBN-path with, being the number of neighbors for each observation to compare chaining-distance with. k has to be smaller than the number of observations in dataset |
COF computes the connectivity-based outlier factor for observations, being the comparison of chaining-distances between observation subject to outlier scoring and neighboring observations. The COF function is useful for outlier detection in clustering and other multidimensional domains.
A vector of COF scores for observations. The greater the COF, the greater outlierness
Jacob H. Madsen
Tang, J., Chen, Z., Fu, A. W. C., & Cheung, D. W. (2002). Enhancing Effectiveness of Outlier Detections for Low Density Patterns. In Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD). Taipei. pp. 535-548. DOI: 10.1007/3-540-47887-6_53
# Create dataset X <- iris[,1:4] # Find outliers by setting an optional k outlier_score <- COF(dataset=X, k=10) # Sort and find index for most outlying observations names(outlier_score) <- 1:nrow(X) sort(outlier_score, decreasing = TRUE) # Inspect the distribution of outlier scores hist(outlier_score)
# Create dataset X <- iris[,1:4] # Find outliers by setting an optional k outlier_score <- COF(dataset=X, k=10) # Sort and find index for most outlying observations names(outlier_score) <- 1:nrow(X) sort(outlier_score, decreasing = TRUE) # Inspect the distribution of outlier scores hist(outlier_score)
Function to calculate how many observations are within a certain sized neighborhood as an outlier score. Outliers are classified according to a user-given threshold of observations to be within the neighborhood. Suggested by Knorr, M., & Ng, R. T. (1997)
DB(dataset, d = 1, fraction = 0.05)
DB(dataset, d = 1, fraction = 0.05)
dataset |
The dataset for which observations are classified as outliers/inliers |
d |
The radius of the neighborhood |
fraction |
The proportion of the number of observations to be within the neighborhood for observations to be classified as inliers. If the proportion of observations within the neighborhood is less than the given fraction, observations are classified as outliers |
DB computes a neighborhood for each observation given a radius (argument 'd') and returns the number of neighbors within the neighborhood. Observations are classified as inliers or outliers, based on a proportion (argument 'fraction') of observations to be within the neighborhood
neighbors |
The number of neighbors within the neighborhood |
classification |
Binary classification of observations as inlier or outlier |
Jacob H. Madsen
Knorr, M., & Ng, R. T. (1997). A Unified Approach for Mining Outliers. In Conf. of the Centre for Advanced Studies on Collaborative Research (CASCON). Toronto, Canada. pp. 236-248. DOI: 10.1145/782010.782021
# Create dataset X <- iris[,1:4] # Classify observations cls_observations <- DB(dataset=X, d=1, fraction=0.05)$classification # Remove outliers from dataset X <- X[cls_observations=='Inlier',]
# Create dataset X <- iris[,1:4] # Classify observations cls_observations <- DB(dataset=X, d=1, fraction=0.05)$classification # Remove outliers from dataset X <- X[cls_observations=='Inlier',]
Function to calculate the influenced outlierness as an outlier score for observations. Suggested by Jin, W., Tung, A. K. H., Han, J., & Wang, W. (2006)
INFLO(dataset, k = 5)
INFLO(dataset, k = 5)
dataset |
The dataset for which observations have an INFLO score returned |
k |
The number of reverse k-nearest neighbors to compare density with. k has to be smaller than the number of observations in dataset |
INFLO computes the influenced outlierness score for observations, being the comparison of density in neighborhood of observation subject to outlier scoring and density in the reverse neighborhood. A kd-tree is used for kNN computation, using the kNN() function from the 'dbscan' package. The INFLO function is useful for outlier detection in clustering and other multidimensional domains
A vector of INFLO scores for observations. The greater the INFLO, the greater outlierness
Jacob H. Madsen
Jin, W., Tung, A. K. H., Han, J., & Wang, W. (2006). Ranking Outliers Using Symmetric Neighborhood Relationship. In Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD). Singapore. pp 577-593. DOI: 10.1007/11731139_68
# Create dataset X <- iris[,1:4] # Find outliers by setting an optional k outlier_score <- INFLO(dataset=X, k=10) # Sort and find index for most outlying observations names(outlier_score) <- 1:nrow(X) sort(outlier_score, decreasing = TRUE) # Inspect the distribution of outlier scores hist(outlier_score)
# Create dataset X <- iris[,1:4] # Find outliers by setting an optional k outlier_score <- INFLO(dataset=X, k=10) # Sort and find index for most outlying observations names(outlier_score) <- 1:nrow(X) sort(outlier_score, decreasing = TRUE) # Inspect the distribution of outlier scores hist(outlier_score)
Function to calculate a density estimation as an outlier score for observations, over a range of k-nearest neighbors. Suggested by Schubert, E., Zimek, A. & Kriegel, H-P. (2014)
KDEOS(dataset, k_min = 5, k_max = 10, eps = NULL)
KDEOS(dataset, k_min = 5, k_max = 10, eps = NULL)
dataset |
The dataset for which observations have an KDEOS score returned |
k_min |
The k parameter starting the k-range |
k_max |
The k parameter ending the k-range. Has to be smaller than the number of observations in dataset and greater than or equal to k_min |
eps |
An optional minimum bandwidth. If eps is smaller than the mean reachability distance for observations, eps is used. Otherwise mean reachability distance is used as bandwidth |
KDEOS computes a kernel density estimation over a user-given range of k-nearest neighbors. The score is normalized between 0 and 1, such that observation with 1 has the lowest density estimation and greatest outlierness. A gaussian kernel is used for estimation with a bandwidth being the reachability distance for neighboring observations. If a lower user-given bandwidth is desired, putting more weight on outlying observations, eps has to be lower than the mean reachability distance for observations. A kd-tree is used for kNN computation, using the kNN() function from the 'dbscan' package. The KDEOS function is useful for outlier detection in clustering and other multidimensional domains
A vector of KDEOS scores normalized between 1 and 0, with 1 being the greatest outlierness
Jacob H. Madsen
Schubert, E., Zimek, A. & Kriegel, H-P. (2014). Generalized Outlier Detection with Flexible Kernel Density Estimates. Proceedings of the 2014 SIAM International Conference on Data Mining. Philadelphia, USA. pp. 542-550. DOI: 10.1137/1.9781611973440.63
# Create dataset X <- iris[,1:4] # Find outliers by setting an optional range of k's outlier_score <- KDEOS(dataset=X, k_min=10, k_max=15) # Sort and find index for most outlying observations names(outlier_score) <- 1:nrow(X) sort(outlier_score, decreasing = TRUE) # Inspect the distribution of outlier scores hist(outlier_score)
# Create dataset X <- iris[,1:4] # Find outliers by setting an optional range of k's outlier_score <- KDEOS(dataset=X, k_min=10, k_max=15) # Sort and find index for most outlying observations names(outlier_score) <- 1:nrow(X) sort(outlier_score, decreasing = TRUE) # Inspect the distribution of outlier scores hist(outlier_score)
Function to calculate aggregated distance to k-nearest neighbors over a range of k's, as an outlier score. Suggested by Angiulli, F., & Pizzuti, C. (2002)
KNN_AGG(dataset, k_min = 5, k_max = 10)
KNN_AGG(dataset, k_min = 5, k_max = 10)
dataset |
The dataset for which observations have an aggregated k-nearest neighbors distance returned |
k_min |
The k parameter starting the k-range |
k_max |
The k parameter ending the k-range. Has to be smaller than the number of observations in dataset and greater than or equal to k_min |
KNN_AGG computes the aggregated distance to neighboring observations by aggregating the results from k_min-NN to k_max-NN, such that if k_min=1 and k_max=3, results from 1NN, 2NN and 3NN are aggregated. A kd-tree is used for kNN computation, using the kNN function() from the 'dbscan' package. The KNN_AGG function is useful for outlier detection in clustering and other multidimensional domains.
A vector of aggregated distance for observations. The greater the distance, the greater outlierness
Jacob H. Madsen
Angiulli, F., & Pizzuti, C. (2002). Fast Outlier Detection in High Dimensional Spaces. In Int. Conf. on Knowledge Discovery and Data Mining (SIGKDD). Helsinki, Finland. pp. 15-26. DOI: 10.1007/3-540-45681-3_2
# Create dataset X <- iris[,1:4] # Find outliers by setting a range of k's outlier_score <- KNN_AGG(dataset=X, k_min=10, k_max=15) # Sort and find index for most outlying observations names(outlier_score) <- 1:nrow(X) sort(outlier_score, decreasing = TRUE) # Inspect the distribution of outlier scores hist(outlier_score)
# Create dataset X <- iris[,1:4] # Find outliers by setting a range of k's outlier_score <- KNN_AGG(dataset=X, k_min=10, k_max=15) # Sort and find index for most outlying observations names(outlier_score) <- 1:nrow(X) sort(outlier_score, decreasing = TRUE) # Inspect the distribution of outlier scores hist(outlier_score)
Function to calculate in-degree as an outlier score for observations, given a k-nearest neighbors graph. Suggested by Hautamaki, V., & Ismo, K. (2004)
KNN_IN(dataset, k = 5)
KNN_IN(dataset, k = 5)
dataset |
The dataset for which observations have an in-degree returned |
k |
The number of k-nearest neighbors to construct a graph with. Has to be smaller than the number of observations in dataset |
KNN_IN computes the in-degree, being the number of reverse neighbors. For computing the in-degree, a k-nearest neighbors graph is computed. A kd-tree is used for kNN computation, using the kNN() function from the 'dbscan' package. The KNN_IN function is useful for outlier detection in clustering and other multidimensional domains.
A vector of in-degree for observations. The smaller the in-degree, the greater outlierness
Jacob H. Madsen
Hautamaki, V., & Ismo, K. (2004). Outlier Detection Using k-Nearest Neighbour Graph. In International Conference on Pattern Recognition. Cambridge, UK. pp. 430-433. DOI: 10.1109/ICPR.2004.1334558
# Create dataset X <- iris[,1:4] # Find outliers by setting an optional k outlier_score <- KNN_IN(dataset=X, k=10) # Sort and find index for most outlying observations names(outlier_score) <- 1:nrow(X) sort(outlier_score, decreasing = FALSE) # Inspect the distribution of outlier scores hist(outlier_score)
# Create dataset X <- iris[,1:4] # Find outliers by setting an optional k outlier_score <- KNN_IN(dataset=X, k=10) # Sort and find index for most outlying observations names(outlier_score) <- 1:nrow(X) sort(outlier_score, decreasing = FALSE) # Inspect the distribution of outlier scores hist(outlier_score)
Function to calculate sum of distance to k-nearest neighbors as an outlier score, based on a kd-tree
KNN_SUM(dataset, k=5)
KNN_SUM(dataset, k=5)
dataset |
The dataset for which observations have a summed k-nearest neighbors distance returned |
k |
The number of k-nearest neighbors. k has to be smaller than the number of observations in dataset |
KNN_SUM computes the sum of distance to neighboring observations. A kd-tree is used for kNN computation, using the kNN() function from the 'dbscan' package. The KNN_SUM function is useful for outlier detection in clustering and other multidimensional domains.
A vector of summed distance for observations. The greater distance, the greater outlierness
Jacob H. Madsen
# Create dataset and set an optional k X <- iris[,1:4] K <- 5 # Find outliers outlier_score <- KNN_SUM(dataset=X, k=K) # Sort and find index for most outlying observations names(outlier_score) <- 1:nrow(X) sort(outlier_score, decreasing = TRUE) # Inspect the distribution of outlier scores hist(outlier_score)
# Create dataset and set an optional k X <- iris[,1:4] K <- 5 # Find outliers outlier_score <- KNN_SUM(dataset=X, k=K) # Sort and find index for most outlying observations names(outlier_score) <- 1:nrow(X) sort(outlier_score, decreasing = TRUE) # Inspect the distribution of outlier scores hist(outlier_score)
Function to calculate a Local Density Estimate (LDE) and Local Density Factor (LDF), as an outlier score, with a gaussian kernel. Suggested by Latecki, L., Lazarevic, A. & Prokrajac, D. (2007)
LDF(dataset, k = 5, h = 1, c = 1)
LDF(dataset, k = 5, h = 1, c = 1)
dataset |
The dataset for which observations have an LDE and LDF score returned |
k |
The number of k-nearest neighbors to compare density estimation with. k has to be smaller than number of observations in dataset |
h |
User-given bandwidth for kernel functions. The greater the bandwidth, the smoother kernels and lesser weight are put on outliers. Default is 1 |
c |
Scaling constant for comparison of LDE to neighboring observations. LDF is the comparison of average LDE for an observation and its neighboring observations. Thus, c=1 gives results in an LDF between 0 and 1, while c=0 can result in very large or infinite values of LDF. Default is 1 |
LDF computes a kernel density estimation, called LDE, over a user-given number of k-nearest neighbors. The LDF score is the comparison of Local Density Estimate (LDE) for an observation to its neighboring observations. Naturally, if an observation has a greater LDE than its neighboring observations, it has no outlierness whereas an observation with smaller LDE than its neighboring observations has great outlierness. A kd-tree is used for kNN computation, using the kNN() function from the 'dbscan' package. The LDF function is useful for outlier detection in clustering and other multidimensional domains
LDE |
A vector of Local Density Estimate for observations. The greater the LDE, the greater centrality |
LDF |
A vector of Local Density Factor for observations. The greater the LDF, the greater the outlierness |
Jacob H. Madsen
Latecki, L., Lazarevic, A. & Prokrajac, D. (2007). Outlier Detection with Kernel Density Functions. International Workshop on Machine Learning and Data Mining in Pattern Recognition: Machine Learning and Data Mining in Pattern Recognition. pp. 61-75. DOI: 10.1007/978-3-540-73499-4_6
# Create dataset X <- iris[,1:4] # Find outliers by setting an optional range of k's outlier_score <- LDF(dataset=X, k=10, h=2, c=1)$LDF # Sort and find index for most outlying observations names(outlier_score) <- 1:nrow(X) sort(outlier_score, decreasing = TRUE) # Inspect the distribution of outlier scores hist(outlier_score)
# Create dataset X <- iris[,1:4] # Find outliers by setting an optional range of k's outlier_score <- LDF(dataset=X, k=10, h=2, c=1)$LDF # Sort and find index for most outlying observations names(outlier_score) <- 1:nrow(X) sort(outlier_score, decreasing = TRUE) # Inspect the distribution of outlier scores hist(outlier_score)
Function to calculate Local Distance-based Outlier Factor (LDOF) as an outlier score for observations. Suggested by Zhang, K., Hutter, M. & Jin, H. (2009)
LDOF(dataset, k = 5)
LDOF(dataset, k = 5)
dataset |
The dataset for which observations have an LDOF score returned |
k |
The number of nearest neighbors to compare distances with |
LDOF computes distance for an observations to its to k-nearest neighbors and compare the distance with the average distances between the nearest neighbors. The LDOF function is useful for outlier detection in clustering and other multidimensional domains
A vector of LDOF scores for observations. The greater the LDOF score, the greater outlierness
Jacob H. Madsen
Zhang, K., Hutter, M. & Jin, H. (2009). A New Local Distance-based Outlier Detection Approach for Scattered Real-World Data. Pacific-Asia Conference on Knowledge Discovery and Data Mining: Advances in Knowledge Discovery and Data Mining. pp. 813-822. DOI: 10.1007/978-3-642-01307-2_84
# Create dataset X <- iris[,1:4] # Find outliers by setting an optional range of k's outlier_score <- LDOF(dataset=X, k=10) # Sort and find index for most outlying observations names(outlier_score) <- 1:nrow(X) sort(outlier_score, decreasing = TRUE) # Inspect the distribution of outlier scores hist(outlier_score)
# Create dataset X <- iris[,1:4] # Find outliers by setting an optional range of k's outlier_score <- LDOF(dataset=X, k=10) # Sort and find index for most outlying observations names(outlier_score) <- 1:nrow(X) sort(outlier_score, decreasing = TRUE) # Inspect the distribution of outlier scores hist(outlier_score)
Function to calculate Local Correlation Integral (LOCI) as an outlier score for observations. Suggested by Papadimitriou, S., Gibbons, P. B., & Faloutsos, C. (2003). Uses a k number of nearest neighbors instead of a constant radius
LOCI(dataset, alpha = 0.5, nn = 20, k = 3)
LOCI(dataset, alpha = 0.5, nn = 20, k = 3)
dataset |
The dataset for which observations have a LOCI returned |
alpha |
The parameter setting the size of the sampling neighborhood, as a proportion of the counting neighborhood, for observations to identify other observations in their respective neighborhood. An alpha of 1 equals a sampling neighborhood the size of the counting neighborhood (the size of distance to nn). An alpha of 0.5 equals a sampling neighborhood half the size of the counting neighborhood |
nn |
The number of nearest neighbors to compare sampling neighborhood with. Original paper suggest a constant user-given radius that includes at least 20 neighbors in order to introduce statistical errors in MDEF. Default is 20 |
k |
The number of standard deviations the sampling neighborhood of an observation should differ from the sampling neighborhood of neighboring observations, to be an outlier. Default is set to 3 as used in original papers experiments |
LOCI computes a counting neighborhood to the nn nearest observations, where the radius is equal to the outermost observation. Within the counting neighborhood each observation has a sampling neighborhood of which the size is determined by the alpha input parameter. LOCI returns an outlier score based on the standard deviation of the sampling neighborhood, called the multi-granularity deviation factor (MDEF). The LOCI function is useful for outlier detection in clustering and other multidimensional domains
npar_pi |
A vector of the number of observations within the sample neighborhood for observations |
avg_npar |
A vector of average number of observations within the sample neighborhood for neighboring observations |
sd_npar |
A vector of standard deviations for observations sample neighborhood |
MDEF |
A vector of the multi-granularity deviation factor (MDEF) for observations. The greater the MDEF, the greater the outlierness |
norm_MDEF |
A vector of normalized MDEF-values, being sd_npar/avg_npar |
class |
Classification of observations as inliers/outliers following the rule of k |
Jacob H. Madsen
Papadimitriou, S., Gibbons, P. B., & Faloutsos, C. (2003). LOCI: Fast Outlier Detection Using the Local Correlation Integral. In International Conference on Data Engineering. pp. 315-326. DOI: 10.1109/ICDE.2003.1260802
# Create dataset X <- iris[,1:4] # Classify observations cls_observations <- LOCI(dataset=X, alpha=0.5, nn=20, k=1)$class # Remove outliers from dataset X <- X[cls_observations=='Inlier',]
# Create dataset X <- iris[,1:4] # Classify observations cls_observations <- LOCI(dataset=X, alpha=0.5, nn=20, k=1)$class # Remove outliers from dataset X <- X[cls_observations=='Inlier',]
Function to calculate the Local Outlier Factor (LOF) as an outlier score for observations. Suggested by Breunig, M. M., Kriegel, H.-P., Ng, R. T., & Sander, J. (2000)
LOF(dataset, k = 5)
LOF(dataset, k = 5)
dataset |
The dataset for which observations have an LOF score returned |
k |
The number of k-nearest neighbors to compare density with. k has to be smaller than number of observations in dataset |
LOF computes a local density for observations with a user-given k-nearest neighbors. The density is compared to the density of the respective nearest neighbors, resulting in the local outlier factor. A kd-tree is used for kNN computation, using the kNN() function from the 'dbscan' package. The LOF function is useful for outlier detection in clustering and other multidimensional domains
A vector of LOF scores for observations. The greater the LOF, the greater outlierness
Jacob H. Madsen
Breunig, M. M., Kriegel, H.-P., Ng, R. T., & Sander, J. (2000). LOF: Identifying Density-Based Local Outliers. In Int. Conf. On Management of Data. Dallas, TX. pp. 93-104. DOI: 10.1145/342009.335388
# Create dataset X <- iris[,1:4] # Find outliers by setting an optional k outlier_score <- LOF(dataset=X, k=10) # Sort and find index for most outlying observations names(outlier_score) <- 1:nrow(X) sort(outlier_score, decreasing = TRUE) # Inspect the distribution of outlier scores hist(outlier_score)
# Create dataset X <- iris[,1:4] # Find outliers by setting an optional k outlier_score <- LOF(dataset=X, k=10) # Sort and find index for most outlying observations names(outlier_score) <- 1:nrow(X) sort(outlier_score, decreasing = TRUE) # Inspect the distribution of outlier scores hist(outlier_score)
Function to calculate the Local Outlier Probability (LOOP) as an outlier score for observations. Suggested by Kriegel, H.-P., Kröger, P., Schubert, E., & Zimek, A. (2009)
LOOP(dataset, k = 5, lambda = 3)
LOOP(dataset, k = 5, lambda = 3)
dataset |
The dataset for which observations have a LOOP score returned |
k |
The number of k-nearest neighbors to compare density with |
lambda |
Multiplication factor for standard deviation. The greater lambda, the smoother results. Default is 3 as used in original papers experiments |
LOOP computes a local density based on probabilistic set distance for observations, with a user-given k-nearest neighbors. The density is compared to the density of the respective nearest neighbors, resulting in the local outlier probability. The values ranges from 0 to 1, with 1 being the greatest outlierness. A kd-tree is used for kNN computation, using the kNN() function from the 'dbscan' package. The LOOP function is useful for outlier detection in clustering and other multidimensional domains
A vector of LOOP scores for observations. 1 indicates outlierness and 0 indicate inlierness
Jacob H. Madsen
Kriegel, H.-P., Kröger, P., Schubert, E., & Zimek, A. (2009). LoOP: Local Outlier Probabilities. In ACM Conference on Information and Knowledge Management, CIKM 2009, Hong Kong, China. pp. 1649-1652. DOI: 10.1145/1645953.1646195
# Create dataset X <- iris[,1:4] # Find outliers by setting an optional k outlier_score <- LOOP(dataset=X, k=10, lambda=3) # Sort and find index for most outlying observations names(outlier_score) <- 1:nrow(X) sort(outlier_score, decreasing = TRUE) # Inspect the distribution of outlier scores hist(outlier_score)
# Create dataset X <- iris[,1:4] # Find outliers by setting an optional k outlier_score <- LOOP(dataset=X, k=10, lambda=3) # Sort and find index for most outlying observations names(outlier_score) <- 1:nrow(X) sort(outlier_score, decreasing = TRUE) # Inspect the distribution of outlier scores hist(outlier_score)
Function to identify natural neighbors and the right k-parameter for kNN graphs as suggested by Zhu, Q., Feng, Ji. & Huang, J. (2016)
NAN(dataset, NaN_Edges = FALSE)
NAN(dataset, NaN_Edges = FALSE)
dataset |
The dataset for which natural neighbors are identified along with a k-parameter |
NaN_Edges |
Choice for computing natural neighbors. Computational heavy to compute |
NAN computes the natural neighbor eigenvalue and identifies natural neighbors in a dataset. The natural neighbor eigenvalue is powerful as k-parameter for computing a k-nearest neighborhood, being suitable for outlier detection, clustering or predictive modelling. Natural neighbors are defined as two observations being mutual k-nearest neighbors. A kd-tree is used for kNN computation, using the kNN() function from the 'dbscan' package
NaN_Num |
The number of in-degrees for observations given r |
r |
Natural neighbor eigenvalue. Useful as k-parameter |
NaN_Edges |
Matrix of edges for natural neighbors |
n_NaN |
The number of natural neighbors |
Jacob H. Madsen
Zhu, Q., Feng, Ji. & Huang, J. (2016). Natural neighbor: A self-adaptive neighborhood method without parameter K. Pattern Recognition Letters. pp. 30-36. DOI: 10.1016/j.patrec.2016.05.007
# Select dataset X <- iris[,1:4] # Identify the right k-parameter K <- NAN(X, NaN_Edges=FALSE)$r # Use the k-setting in an abitrary outlier detection algorithm outlier_score <- LOF(dataset=X, k=K) # Sort and find index for most outlying observations names(outlier_score) <- 1:nrow(X) sort(outlier_score, decreasing = TRUE) # Inspect the distribution of outlier scores hist(outlier_score)
# Select dataset X <- iris[,1:4] # Identify the right k-parameter K <- NAN(X, NaN_Edges=FALSE)$r # Use the k-setting in an abitrary outlier detection algorithm outlier_score <- LOF(dataset=X, k=K) # Sort and find index for most outlying observations names(outlier_score) <- 1:nrow(X) sort(outlier_score, decreasing = TRUE) # Inspect the distribution of outlier scores hist(outlier_score)
Function to calculate the Natural Outlier Factor (NOF) as an outlier score for observations. Suggested by Huang, J., Zhu, Q., Yang, L. & Feng, J. (2015)
NOF(dataset)
NOF(dataset)
dataset |
The dataset for which observations have a NOF score returned |
NOF computes the nearest and reverse nearest neighborhood for observations, based on the natural neighborhood algorithm. Density is compared between observations and their neighbors. A kd-tree is used for kNN computation, using the kNN() function from the 'dbscan' package
nb |
A vector of in-degrees for observations |
max_nb |
Maximum in-degree observations in nb vector. Used as k-parameter in outlier detection of NOF |
r |
The natural neighbor eigenvalue |
NOF |
A vector of Natural Outlier Factor scores. The greater the NOF, the greater the outlierness |
Jacob H. Madsen
Huang, J., Zhu, Q., Yang, L. & Feng, J. (2015). A non-parameter outlier detection algorithm based on Natural Neighbor. Knowledge-Based Systems. pp. 71-77. DOI: 10.1016/j.knosys.2015.10.014
# Select dataset X <- iris[,1:4] # Run NOF algorithm outlier_score <- NOF(dataset=X)$NOF # Sort and find index for most outlying observations names(outlier_score) <- 1:nrow(X) sort(outlier_score, decreasing = TRUE) # Inspect the distribution of outlier scores hist(outlier_score)
# Select dataset X <- iris[,1:4] # Run NOF algorithm outlier_score <- NOF(dataset=X)$NOF # Sort and find index for most outlying observations names(outlier_score) <- 1:nrow(X) sort(outlier_score, decreasing = TRUE) # Inspect the distribution of outlier scores hist(outlier_score)
Function to calculate the Relative Density-based Outlier Factor (RDOS) as an outlier score for observations. Suggested by Tang, B. & Haibo, He. (2017)
RDOS(dataset, k = 5, h = 1)
RDOS(dataset, k = 5, h = 1)
dataset |
The dataset for which observations have an RDOS score returned |
k |
The number of k-nearest neighbors used to identify reverse- and shared nearest neighbors |
h |
Bandwidth parameter for gaussian kernel. A small h put more weight on outlying observations |
RDOS computes a kernel density estimation by combining the nearest, reverse nearest and shared neighbors into one neighborhood. The density estimation is compared to the density estimation of the neighborhoods observations. A gaussian kernel is used for density estimation, given a bandwidth chosen by user. A kd-tree is used for kNN computation, using the kNN() function from the 'dbscan' package.
It is a computational heavy task to identify reverse and shared neighbors from the kd-tree. Thus, the RDOS has high complexity and is not recommended to apply to datasets with n>5000. The RDOS function is useful for outlier detection in clustering and other multidimensional domains
A vector of RDOS scores for observations. The greater the RDOS score, the greater outlierness
Jacob H. Madsen
Tang, B. & Haibo, He. (2017). A local density-based approach for outlier detection. Neurocomputing. pp. 171-180. DOI: 10.1016/j.neucom.2017.02.039
# Create dataset X <- iris[,1:4] # Find outliers by setting an optional k outlier_score <- RDOS(dataset=X, k=10, h=2) # Sort and find index for most outlying observations names(outlier_score) <- 1:nrow(X) sort(outlier_score, decreasing = TRUE) # Inspect the distribution of outlier scores hist(outlier_score)
# Create dataset X <- iris[,1:4] # Find outliers by setting an optional k outlier_score <- RDOS(dataset=X, k=10, h=2) # Sort and find index for most outlying observations names(outlier_score) <- 1:nrow(X) sort(outlier_score, decreasing = TRUE) # Inspect the distribution of outlier scores hist(outlier_score)
Function to to calculate the RKOF score for observations as suggested by Gao, J., Hu, W., Zhang, X. & Wu, Ou. (2011)
RKOF(dataset, k = 5, C = 1, alpha = 1, sigma2 = 1)
RKOF(dataset, k = 5, C = 1, alpha = 1, sigma2 = 1)
dataset |
The dataset for which observations have an RKOF score returned |
k |
The number of nearest neighbors to compare density estimation with |
C |
Multiplication parameter for k-distance of neighboring observations. Act as bandwidth increaser. Default is 1 such that k-distance is used for the gaussian kernel |
alpha |
Sensivity parameter for k-distance/bandwidth. Small alpha creates small variance in RKOF and vice versa. Default is 1 |
sigma2 |
Variance parameter for weighting of neighboring observations |
RKOF computes a kernel density estimation by comparing density estimation to the density of neighboring observations. A gaussian kernel is used for density estimation, given a bandwidth with k-distance. K-distance can be influenced with the parameters C and alpha. A kd-tree is used for kNN computation, using the kNN() function from the 'dbscan' package. The RKOF function is useful for outlier detection in clustering and other multidimensional domains
A vector of RKOF scores for observations. The greater the RKOF score, the greater outlierness
Jacob H. Madsen
Gao, J., Hu, W., Zhang, X. & Wu, Ou. (2011). RKOF: Robust Kernel-Based Local Outlier Detection. Pacific-Asia Conference on Knowledge Discovery and Data Mining: Advances in Knowledge Discovery and Data Mining. pp. 270-283. DOI: 10.1007/978-3-642-20847-8_23
# Create dataset X <- iris[,1:4] # Find outliers by setting an optional k outlier_score <- RKOF(dataset=X, k = 10, C = 1, alpha = 1, sigma2 = 1) # Sort and find index for most outlying observations names(outlier_score) <- 1:nrow(X) sort(outlier_score, decreasing = TRUE) # Inspect the distribution of outlier scores hist(outlier_score)
# Create dataset X <- iris[,1:4] # Find outliers by setting an optional k outlier_score <- RKOF(dataset=X, k = 10, C = 1, alpha = 1, sigma2 = 1) # Sort and find index for most outlying observations names(outlier_score) <- 1:nrow(X) sort(outlier_score, decreasing = TRUE) # Inspect the distribution of outlier scores hist(outlier_score)