Title: | Interactively Filter SNP Datasets |
---|---|
Description: | Is designed to interactively and reproducibly visualize and filter SNP (single-nucleotide polymorphism) datasets. This R-based implementation of SNP and genotype filters facilitates an interactive and iterative SNP filtering pipeline, which can be documented reproducibly via Rmarkdown. 'SNPfiltR' contains functions for visualizing various quality and missing data metrics for a SNP dataset, and then filtering the dataset based on user specified cutoffs. All functions take 'vcfR' objects as input, which can easily be generated by reading standard vcf (variant call format) files into R using the R package 'vcfR' (Knaus and Grünwald) (<doi:10.1111/1755-0998.12549>). Each 'SNPfiltR' function can return a newly filtered vcfR object, which can then be written to a local directory in standard vcf format using the 'vcfR' package, for downstream population genetic and phylogenetic analyses. |
Authors: | Devon DeRaad [aut, cre] |
Maintainer: | Devon DeRaad <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.0.1 |
Built: | 2025-01-12 06:27:59 UTC |
Source: | https://github.com/cranhaven/cranhaven.r-universe.dev |
This function can be run in two ways: 1) Without 'thresholds' specified. This will run a PCA for the input vcf without filtering, and visualize the clustering of samples in two-dimensional space, coloring each sample according to a priori population assignment given in the popmap. 2) With 'thresholds' specified. This will filter your input vcf file to the specified missing data thresholds, and run a PCA for each filtering iteration. For each iteration, a 2D plot will be output showing clustering according to the specified popmap. This option is ideal for assessing the effects of missing data on clustering patterns.
assess_missing_data_pca( vcfR, popmap = NULL, thresholds = NULL, clustering = TRUE )
assess_missing_data_pca( vcfR, popmap = NULL, thresholds = NULL, clustering = TRUE )
vcfR |
a vcfR object |
popmap |
set of population assignments that will be used to color code the plots |
thresholds |
optionally specify a vector of missing data filtering thresholds to explore |
clustering |
use partitioning around medoids (PAM) to do unsupervised clustering on the output? (default = TRUE, max clusters = # of levels in popmap + 2) |
a series of plots showing the clustering of all samples in two-dimensional space
assess_missing_data_pca(vcfR = SNPfiltR::vcfR.example, popmap = SNPfiltR::popmap, thresholds = c(.6,.8))
assess_missing_data_pca(vcfR = SNPfiltR::vcfR.example, popmap = SNPfiltR::popmap, thresholds = c(.6,.8))
This function can be run in two ways: 1) Without 'thresholds' specified. This will run t-SNE for the input vcf without filtering, and visualize the clustering of samples in two-dimensional space, coloring each sample according to a priori population assignment given in the popmap. 2) With 'thresholds' specified. This will filter your input vcf file to the specified missing data thresholds, and run a t-SNE clustering analysis for each filtering iteration. For each iteration, a 2D plot will be output showing clustering according to the specified popmap. This option is ideal for assessing the effects of missing data on clustering patterns.
assess_missing_data_tsne( vcfR, popmap = NULL, thresholds = NULL, perplexity = NULL, iterations = NULL, initial_dims = NULL, clustering = TRUE )
assess_missing_data_tsne( vcfR, popmap = NULL, thresholds = NULL, perplexity = NULL, iterations = NULL, initial_dims = NULL, clustering = TRUE )
vcfR |
a vcfR object |
popmap |
set of population assignments that will be used to color code the plots |
thresholds |
a vector specifying the missing data filtering thresholds to explore |
perplexity |
numerical value specifying the perplexity paramter during t-SNE (default: 5) |
iterations |
a numerical value specifying the number of iterations for t-SNE (default: 1000) |
initial_dims |
a numerical value specifying the number of initial_dimensions for t-SNE (default: 5) |
clustering |
use partitioning around medoids (PAM) to do unsupervised clustering on the output? (default = TRUE, max clusters = # of levels in popmap + 2) |
a series of plots showing the clustering of all samples in two-dimensional space
assess_missing_data_tsne(vcfR = SNPfiltR::vcfR.example, popmap = SNPfiltR::popmap, thresholds = .8)
assess_missing_data_tsne(vcfR = SNPfiltR::vcfR.example, popmap = SNPfiltR::popmap, thresholds = .8)
This function requires a vcfR object as input, and returns a vcfR object filtered to retain only SNPs greater than a specified distance apart on each scaffold. The function starts by automatically retaining the first SNP on a given scaffold, and then subsequently keeping the next SNP that is greater than the specified distance away, until it reaches the end of the scaffold/chromosome. This function scales well with an increasing number of SNPs, but poorly with an increasing number of scaffolds/chromosomes. For this reason, there is a built in progress bar, to monitor potentially long-running executions with many scaffolds. This type of filtering is often employed to reduce linkage among input SNPs, especially for downstream input to programs like structure, which require unlinked SNPs.
distance_thin(vcfR, min.distance = NULL)
distance_thin(vcfR, min.distance = NULL)
vcfR |
a vcfR object |
min.distance |
a numeric value representing the smallest distance (in base-pairs) allowed between SNPs after distance thinning |
An identical vcfR object, except that SNPs separated by less than the specified distance have been removed from the file
distance_thin(vcfR = SNPfiltR::vcfR.example, min.distance = 1000)
distance_thin(vcfR = SNPfiltR::vcfR.example, min.distance = 1000)
This function requires a vcfR object as input, and returns a vcfR object filtered to convert heterozygous sites with an allele balance falling outside of the specified ratio to 'NA'. If no ratio is specified, a default .25-.75 limit is imposed. From the dDocent filtering page "Allele balance: a number between 0 and 1 representing the ratio of reads showing the reference allele to all reads, considering only reads from individuals called as heterozygous, we expect that the allele balance in our data (for real loci) should be close to 0.5".
filter_allele_balance(vcfR, min.ratio = NULL, max.ratio = NULL)
filter_allele_balance(vcfR, min.ratio = NULL, max.ratio = NULL)
vcfR |
a vcfR object |
min.ratio |
minumum allele ratio for a called het |
max.ratio |
maximum allele ratio for a called het |
An identical vcfR object, except that genotypes failing the allele balance filter have been converted to 'NA'.
filter_allele_balance(vcfR = SNPfiltR::vcfR.example)
filter_allele_balance(vcfR = SNPfiltR::vcfR.example)
This function simply removes any SNPs from the vcf file which contains more than two alleles. Many downstream applications require SNPs to be biallelic, so this filter is generally a good idea during processing.
filter_biallelic(vcfR)
filter_biallelic(vcfR)
vcfR |
a vcfR object |
a vcfR object with SNPs containing more than two alleles removed
filter_biallelic(vcfR = SNPfiltR::vcfR.example)
filter_biallelic(vcfR = SNPfiltR::vcfR.example)
This function requires a vcfR object as input. The user can then specify the minimum value for depth of coverage required to retain a called genotype (must be numeric). Additionally, the user can specify a minimum genotype quality required to retain a called genotype (again, must be numeric).
hard_filter(vcfR, depth = NULL, gq = NULL)
hard_filter(vcfR, depth = NULL, gq = NULL)
vcfR |
a vcfR object |
depth |
an integer representing the minimum depth for genotype calls that you wish to retain (e.g. 'depth = 5' would remove all genotypes with a sequencing depth of 4 reads or less) |
gq |
an integer representing the minimum genotype quality for genotype calls that you wish to retain (e.g. 'gq = 30' would remove all genotypes with a quality score of 29 or lower) |
The vcfR object input, with the sites failing specified filters converted to 'NA'
hard_filter(vcfR = SNPfiltR::vcfR.example, depth = 5) hard_filter(vcfR = SNPfiltR::vcfR.example, depth = 5, gq = 30)
hard_filter(vcfR = SNPfiltR::vcfR.example, depth = 5) hard_filter(vcfR = SNPfiltR::vcfR.example, depth = 5, gq = 30)
This function can be run in two ways: 1) specify vcfR object only. This will visualize the distribution of mean depth per sample across all SNPs in your vcf file, and will not alter your vcf file. 2) specify vcfR object, and set 'maxdepth' = 'integer value'. This option will show you where your specified cutoff falls in the distribution of SNP depth, and remove all SNPs with a mean depth above the specified threshold from the vcf. Super high depth loci are likely multiple loci stuck together into a single paralogous locus. Note: This function filters on a 'per SNP' basis rather than a 'per genotype' basis, otherwise it would disproportionately remove genotypes from our deepest sequenced samples (because sequencing depth is so variable between samples).
max_depth(vcfR, maxdepth = NULL)
max_depth(vcfR, maxdepth = NULL)
vcfR |
a vcfR object |
maxdepth |
an integer specifying the maximum mean depth for a SNP to be retained |
The vcfR object input, with SNPs above the 'maxdepth' cutoff removed
max_depth(vcfR = SNPfiltR::vcfR.example) max_depth(vcfR = SNPfiltR::vcfR.example, maxdepth = 100)
max_depth(vcfR = SNPfiltR::vcfR.example) max_depth(vcfR = SNPfiltR::vcfR.example, maxdepth = 100)
This function can be run in two ways: 1) Without 'min.mac' specified. This will return a folded site frequency spectrum (SFS), without performing any filtering on the vcf file. Or 2) With 'min.mac' specified. This will also print the folded SFS and show you where your specified min. MAC count falls. It will then return your vcfR object with SNPs falling below your min. MAC threshold removed. Note: previous filtering steps (especially removing samples) may have resulted in invariant SNPs (MAC =0). For this reason it's a good idea to run min_mac(vcfR, min.mac=1) before using a SNP dataset in downstream analyses.
min_mac(vcfR, min.mac = NULL)
min_mac(vcfR, min.mac = NULL)
vcfR |
a vcfR object |
min.mac |
an integer specifying the minimum minor allele count for a SNP to be retained (e.g. 'min.mac=3' would remove all SNPs with a MAC of 2 or less) |
if 'min.mac' is not specified, the allele frequency spectrum is returned. If 'min.mac' is specified, SNPs falling below the MAC cutoff will be removed, and the filtered vcfR object will be returned.
min_mac(vcfR=SNPfiltR::vcfR.example)
min_mac(vcfR=SNPfiltR::vcfR.example)
This function can be run in two ways: 1) Without 'cutoff' specified. This will vizualise the amount of missing data in each sample across a variety of potential missing data cutoffs. Additionally, it will show you a dotplot ordering the amount of overall missing data in each sample. Based on these visualizations, you can make an informed decision on what you think might be an optimal cutoff to remove samples that are missing too much data to be retained for downstream analyses. 2) with 'cutoff' specified. This option will show you the dotplot with the cutoff you set, and then remove samples above the missing data cutoff you set, and return the filtered vcf to you.
missing_by_sample(vcfR, popmap = NULL, cutoff = NULL)
missing_by_sample(vcfR, popmap = NULL, cutoff = NULL)
vcfR |
a vcfR object |
popmap |
if specifies, it must be a two column dataframe with columns names 'id' and 'pop'. IDs must match the IDs in the vcfR object |
cutoff |
a numeric value between 0-1 specifying the maximum proportion of missing data allowed in a sample to be retained for downstream analyses |
Note: This decision is highly project specific, but these visualizations should help you get a feel for how very low data samples cannot be rescued simply by a missing data SNP filter. If you want to remove specific samples from your vcf that cannot be specified with a simple cutoff refer to this great tutorial which is the basis for the code underlying this function.
if 'cutoff' is not specified, will return a dataframe containing the average depth and proportion missing data in each sample. If 'cutoff' is specified, the samples falling above the missing data cutoff will be removed, and the filtered vcfR object will be returned.
missing_by_sample(vcfR = SNPfiltR::vcfR.example) missing_by_sample(vcfR = SNPfiltR::vcfR.example, cutoff = .7)
missing_by_sample(vcfR = SNPfiltR::vcfR.example) missing_by_sample(vcfR = SNPfiltR::vcfR.example, cutoff = .7)
This function can be run in two ways: 1) Without 'cutoff' specified. This will vizualise the amount of missing data in each sample across a variety of potential missing data cutoffs. Additionally, it will show you dotplots visualizing the number of total SNPs retained across a variety of filtering cutoffs, and the total proportion of missing data. Based on these visualizations, you can make an informed decision on what you think might be an optimal cutoff to minimize the overall missingness of your dataset while still retaining an appropriate amount of SNPs for the downstream inferences you hope to make 2) with 'cutoff' specified. This option will show you the dotplots with the cutoff you set, and then remove SNPs above the missing data cutoff.
missing_by_snp(vcfR, cutoff = NULL)
missing_by_snp(vcfR, cutoff = NULL)
vcfR |
a vcfR object |
cutoff |
a numeric value between 0-1 specifying the maximum proportion of missing data allowed in a SNP to be retained for downstream analyses |
if 'cutoff' is not specified, will return a dataframe containing the proportion missing data and the total SNPs retained across each filtering level. If 'cutoff' is specified, SNPs falling above the missing data cutoff will be removed, and the filtered vcfR object will be returned.
missing_by_snp(vcfR = SNPfiltR::vcfR.example) missing_by_snp(vcfR = SNPfiltR::vcfR.example, cutoff = .6)
missing_by_snp(vcfR = SNPfiltR::vcfR.example) missing_by_snp(vcfR = SNPfiltR::vcfR.example, cutoff = .6)
A dataset containing the sample name and population assignment for the 20 scrub-jay samples in SNPfilR::vcfR.example . The variables are as follows:
data(popmap)
data(popmap)
A data frame with 20 rows and 2 variables
id. unique sample identifier
pop. population assignment for each individual
The SNPfiltR package allows users to interactively visualize the effects of relevant filters on their datasets in order to optimize filtering parameters rather than simply following historical precedent. Each function takes a variant call format (vcf) file, stored in local memory as a vcfR object, as input. Most functions can be run without specified cutoffs, in order to visualize the distribution of the parameter of interest in your given dataset. Then the same function can be run with a specified cutoff, and a filtered vcfR object will be returned. For detailed documentation and vignettes showing fully implemented SNP filtering pipelines, please go to: devonderaad.github.io/SNPfiltR
A vcfR object containing 500 SNPs for 20 individuals. Species assignments for each individual can be accessed via SNPfiltR::popmap
data(vcfR.example)
data(vcfR.example)
A vcfR object containing 500 SNPs for 20 individuals