| Title: | Analyze and Integrate Any Type of Nucleotide Recoding RNA-Seq Data |
|---|---|
| Description: | A complete rewrite and reimagining of 'bakR' (see 'Vock et al.' (2025) <doi:10.1371/journal.pcbi.1013179>). Designed to support a wide array of analyses of nucleotide recoding RNA-seq (NR-seq) datasets of any type, including TimeLapse-seq/SLAM-seq/TUC-seq, Start-TimeLapse-seq (STL-seq), TT-TimeLapse-seq (TT-TL-seq), and subcellular NR-seq. 'EZbakR' extends standard NR-seq standard NR-seq mutational modeling to support multi-label analyses (e.g., 4sU and 6sG dual labeling), and implements an improved hierarchical model to better account for transcript-to-transcript variance in metabolic label incorporation. 'EZbakR' also generalized dynamical systems modeling of NR-seq data to support analyses of premature mRNA processing and flow between subcellular compartments. Finally, 'EZbakR' implements flexible and well-powered comparative analyses of all estimated parameters via design matrix-specified generalized linear modeling. |
| Authors: | Isaac Vock [aut, cre, cph] (ORCID: <https://orcid.org/0000-0002-7178-6886>) |
| Maintainer: | Isaac Vock <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.2.0 |
| Built: | 2026-05-25 09:34:48 UTC |
| Source: | https://github.com/isaacvock/ezbakr |
AverageAndRegularize fits a generalized linear model to your data to effectively
average parameter estimates across replicates and get overall uncertainty estimates
for those parameters. The linear model to which your data is fit is specified via
an R formula object supplied to the formula_mean parameter. Uncertainty estimates
are regularized via a hierarchical modeling strategy originally introduced with
bakR, though slightly improved upon since then.
AverageAndRegularize( obj, features = NULL, parameter = "log_kdeg", type = "kinetics", kstrat = NULL, populations = NULL, fraction_design = NULL, exactMatch = TRUE, repeatID = NULL, formula_mean = NULL, sd_grouping_factors = NULL, include_all_parameters = TRUE, sd_reg_factor = 10, error_if_singular = TRUE, min_reads = 10, convert_tl_to_factor = TRUE, regress_se_with_abs = TRUE, force_lm = FALSE, force_optim = force_lm, conservative = FALSE, character_limit = 20, feature_lengths = NULL, feature_sample_counts = NULL, scale_factor_df = NULL, overwrite = TRUE )AverageAndRegularize( obj, features = NULL, parameter = "log_kdeg", type = "kinetics", kstrat = NULL, populations = NULL, fraction_design = NULL, exactMatch = TRUE, repeatID = NULL, formula_mean = NULL, sd_grouping_factors = NULL, include_all_parameters = TRUE, sd_reg_factor = 10, error_if_singular = TRUE, min_reads = 10, convert_tl_to_factor = TRUE, regress_se_with_abs = TRUE, force_lm = FALSE, force_optim = force_lm, conservative = FALSE, character_limit = 20, feature_lengths = NULL, feature_sample_counts = NULL, scale_factor_df = NULL, overwrite = TRUE )
obj |
An |
features |
Character vector of the set of features you want to stratify
reads by and estimate proportions of each RNA population. The default of "all"
will use all feature columns in the |
parameter |
Parameter to average across replicates of a given condition. |
type |
What type of table is the parameter found in? Default is "kinetics", but can also set to "fractions". |
kstrat |
If |
populations |
Character vector of the set of mutational populations that you want to infer the fractions of. Only relevant if type == "fractions". |
fraction_design |
"Design matrix" specifying which RNA populations exist in your samples. Only relevant if type == "fractions". |
exactMatch |
If TRUE, then |
repeatID |
If multiple |
formula_mean |
An R formula object specifying how the NOTE: EZbakR automatically removes any intercept terms from the model. That way, there is no ambiguity about what parameter is defined as the reference. |
sd_grouping_factors |
What metadf columns should data be grouped by when estimating
standard deviations across replicates? If this is NULL, then EZbakR will check to see
if the |
include_all_parameters |
If TRUE, an additional table will be saved with the prefix |
sd_reg_factor |
Determines how strongly variance estimates are shrunk towards trend. Higher numbers lead to more regularization. Eventually, this will be replaced with estimation of how much variance there seems to be in the population of variances. |
error_if_singular |
If TRUE, linear model will throw an error if parameters cannot be uniquely identified. This is most often caused by parameters that cannot be estimated from the data, e.g., due to limited replicate numbers or correlated sample characteristics (i.e., all treatment As also correspond to batch As, and all treatment Bs correspond to batch Bs). |
min_reads |
Minimum number of reads in all samples for a feature to be kept. |
convert_tl_to_factor |
If a label time variable is included in the |
regress_se_with_abs |
If TRUE, and if |
force_lm |
Certain formula lend them selves to efficient approximations of the
full call to |
force_optim |
Old parameter that is now passed the value |
conservative |
If TRUE, conservative variance regularation will be performed. In this case, variances below the trend will be regularized up to the trend, and variances above the trend will be left unregularized. This avoids underestimation of variances. |
character_limit |
Limit on the number of characters of the name given to the output table. Will attempt to concatenate the parameter name with the names of all of the features. If this is too long, only the parameter name will be used. |
feature_lengths |
Table of effective lengths for each feature combination in your data. For example, if your analysis includes features named GF and XF, this should be a data frame with columns GF, XF, and length. |
feature_sample_counts |
Data frame with columns |
scale_factor_df |
Data frame with columns "sample" and a second column of whatever name you please. The second column should denote scale factors by which read counts in that sample should be multiplied by in order to normalize these read counts. |
overwrite |
If TRUE, identical, existing output will be overwritten. |
The EZbakR website has an extensive vignette walking through various use cases
and model types that you can fit with AverageAndRegularize(): vignette link.
EZbakR improves upon bakR by balancing extra conservativeness in several steps
with a more highly powered statistical testing scheme in its CompareParameters()
function. In particular, the following changes to the variance regularization
scheme were made:
Sample-specific parameter uncertainties are used to generate conservative estimates of feature-specific replicate variabilties. In addition, a small floor is set to ensure that replicate variance estimates are never below a certain level, for the same reason.
Condition-wide replicate variabilities are regressed against both read coverage and either a) |logit(estimate)| when modeling average fraction labeled. This captures the fact thta estimates are best around a logit(fraction labeled) of 0 and get worse for more extreme fraction labeled's.; b) log(kdeg) when modeling log degradation rate constants. At first, I considered a strategy similar to the fraction labeled modeling, but found that agreement between a fully rigorous MCMC sampling approach and EZbakR was significantly improved by just regressing hee value of the log kientic parameter, likely due to the non-linear transformation of fraction labeled to log(kdeg); and c) only coverage in all other cases.
Features with replicate variabilities below the inferred trend have their replicate variabilites set equal to that predicted by the trend. This helps limit underestimation of parameter variance. Features with above-trend replicate variabilties have their replicate variabilities regularized with a Normal prior Normal likelihood Bayesian model, as in bakR (so the log(variance) is the inferred mean of this distribution, and the known variance is inferred from the amount of variance about the linear dataset-wide trend).
All of this allows CompareParameters() to use a less conservative statistical test
when calculating p-values, while still controlling false discovery rates.
EZbakRData object with an additional "averages" table, as well as a
fullfit table under the same list heading, which includes extra information about
the priors used for regularization purposes.
# Simulate data to analyze simdata <- EZSimulate(30) # Create EZbakR input ezbdo <- EZbakRData(simdata$cB, simdata$metadf) # Estimate Fractions ezbdo <- EstimateFractions(ezbdo) # Estimate Kinetics ezbdo <- EstimateKinetics(ezbdo) # Average estimates across replicate ezbdo <- AverageAndRegularize(ezbdo)# Simulate data to analyze simdata <- EZSimulate(30) # Create EZbakR input ezbdo <- EZbakRData(simdata$cB, simdata$metadf) # Estimate Fractions ezbdo <- EstimateFractions(ezbdo) # Estimate Kinetics ezbdo <- EstimateKinetics(ezbdo) # Average estimates across replicate ezbdo <- AverageAndRegularize(ezbdo)
CompareParameters() calculates differences in parameters estimated by
AverageAndRegularize() or EZDynamics() and performs null hypothesis
statistical testing, comparing their values to a null hypothesis of 0.
CompareParameters( obj, design_factor, reference, experimental, param_name, parameter = "log_kdeg", type = "averages", param_function, condition = NULL, features = NULL, exactMatch = TRUE, repeatID = NULL, formula_mean = NULL, sd_grouping_factors = NULL, fit_params = NULL, normalize_by_median = FALSE, reference_levels = NULL, experimental_levels = NULL, overwrite = TRUE )CompareParameters( obj, design_factor, reference, experimental, param_name, parameter = "log_kdeg", type = "averages", param_function, condition = NULL, features = NULL, exactMatch = TRUE, repeatID = NULL, formula_mean = NULL, sd_grouping_factors = NULL, fit_params = NULL, normalize_by_median = FALSE, reference_levels = NULL, experimental_levels = NULL, overwrite = TRUE )
obj |
An |
design_factor |
Name of factor from |
reference |
Name of reference |
experimental |
Name of |
param_name |
If you want to assess the significance of a single parameter, rather than the comparison of two parameters, specify that one parameter's name here. |
parameter |
Parameter to average across replicates of a given condition. |
type |
Type of table to use. Can either be "averages" or "dynamics". |
param_function |
NOT YET IMPLEMENTED. Will allow you to specify more complicated functions of parameters when hypotheses you need to test are combinations of parameters rather than individual parameters or simple differences in two parameters. |
condition |
Same as |
features |
Character vector of the set of features you want to stratify
reads by and estimate proportions of each RNA population. The default of "all"
will use all feature columns in the |
exactMatch |
If TRUE, then |
repeatID |
If multiple |
formula_mean |
An R formula object specifying how the |
sd_grouping_factors |
Metadf columns should data was grouped by when estimating standard deviations across replicates for the averages object you want to use. |
fit_params |
Character vector of parameter names in the averages object you would like to use. |
normalize_by_median |
If TRUE, then median difference will be set equal to 0. This can account for global biases in parameter estimates due to things like differences in effective label times. Does risk eliminating real signal though, so user discretion is advised. |
reference_levels |
Same as |
experimental_levels |
Same as |
overwrite |
If TRUE, then identical output will be overwritten if it exists. |
The EZbakR website has an extensive vignette walking through various use cases
and parameters you can compare with CompareParameters(): vignette link.
There are essentially 3 scenarios that CompareParameters() can handle:
Pairwise comparisons: compare reference to experimental parameter
estimates of a specified design_factor from AverageAndRegularize(). log(experimental / reference) is
the computed "difference" in this case.
Assess the value of a single parameter, which itself should represent a
difference between other parameters. The name of this parameter can be specified
via the param_name argument. This is useful for various interaction models
where some of the parameters of these models may represent things like "effect of
A on condition X".
Pairwise comparison of dynamical systems model parameter estimate: similar
to the first case listed above, but now when type == "dynamics". design_factor can
now be a vector of all the metadf columns you stratified parameter estimates by.
Eventually, a 4th option via the currently non-functional param_function argument
will be implemented, which will allow you to specify functions of parameters to be assessed,
which can be useful for certain interaction models.
CompareParameters() calculates p-values using what is essentially an asymptotic Wald test,
meaning that a standard normal distribution is integrated. P-values are then multiple-test
adjusted using the method of Benjamini and Hochberg, implemented in R's p.adjust()
function.
EZbakRData object with an additional "comparisons" table, detailing
the result of a comparison of a parameter estimate's valules across two different
conditions.
# Simulate data to analyze simdata <- EZSimulate(30) # Create EZbakR input ezbdo <- EZbakRData(simdata$cB, simdata$metadf) # Estimate Fractions ezbdo <- EstimateFractions(ezbdo) # Estimate Kinetics ezbdo <- EstimateKinetics(ezbdo) # Average estimates across replicate ezbdo <- AverageAndRegularize(ezbdo) # Compare parameters across conditions ezbdo <- CompareParameters( ezbdo, design_factor = "treatment", reference = "treatment1", experimental = "treatment2" )# Simulate data to analyze simdata <- EZSimulate(30) # Create EZbakR input ezbdo <- EZbakRData(simdata$cB, simdata$metadf) # Estimate Fractions ezbdo <- EstimateFractions(ezbdo) # Estimate Kinetics ezbdo <- EstimateKinetics(ezbdo) # Average estimates across replicate ezbdo <- AverageAndRegularize(ezbdo) # Compare parameters across conditions ezbdo <- CompareParameters( ezbdo, design_factor = "treatment", reference = "treatment1", experimental = "treatment2" )
Uses the strategy described here, and similar to that originally presented in Berg et al. 2024.
CorrectDropout( obj, strategy = c("grandR", "bakR"), grouping_factors = NULL, features = NULL, populations = NULL, fraction_design = NULL, repeatID = NULL, exactMatch = TRUE, read_cutoff = 25, dropout_cutoff = 5, ... )CorrectDropout( obj, strategy = c("grandR", "bakR"), grouping_factors = NULL, features = NULL, populations = NULL, fraction_design = NULL, repeatID = NULL, exactMatch = TRUE, read_cutoff = 25, dropout_cutoff = 5, ... )
obj |
An EZbakRFractions object, which is an EZbakRData object on which
you have run |
strategy |
Which dropout correction strategy to use. Options are:
The "bakR" strategy has the advantage of being model-derived, making it possible to assess model fit and thus whether the simple assumptions of both the "bakR" and "grandR" dropout models are met. The "grandR" strategy has the advantage of being more robust. Thus, the "grandR" strategy is currently used by default. |
grouping_factors |
Which sample-detail columns in the metadf should be used
to group -s4U samples by for calculating the average -s4U RPM? The default value of
|
features |
Character vector of the set of features you want to stratify
reads by and estimate proportions of each RNA population. The default of |
populations |
Mutational populations that were analyzed to generate the fractions table to use. For example, this would be "TC" for a standard s4U-based nucleotide recoding experiment. |
fraction_design |
"Design matrix" specifying which RNA populations exist
in your samples. By default, this will be created automatically and will assume
that all combinations of the |
repeatID |
If multiple |
exactMatch |
If TRUE, then |
read_cutoff |
Minimum number of reads for a feature to be used to fit the dropout model. |
dropout_cutoff |
Maximum ratio of -s4U:+s4U RPMs for a feature to be used to fit the dropout model (i.e., simple outlier filtering cutoff). |
... |
Parameters passed to internal |
Dropout is the disproportionate loss of labeled RNA/reads from said RNA
described independently here
and here. It can originate from a combination of
bioinformatic (loss of high mutation content reads due to alignment problems),
technical (loss of labeled RNA during RNA extraction), and biological (transcriptional
shutoff in rare cases caused by metabolic label toxicity) sources.
CorrectDropout() compares label-fed and label-free controls from the same
experimental conditions to estimate and correct for this dropout. It assumes
that there is a single number (referred to as the dropout rate, or pdo) which
describes the rate at which labeled RNA is lost (relative to unlabeled RNA).
pdo ranges from 0 (no dropout) to 1 (complete loss of all labeled RNA), and
is thus interpreted as the percentage of labeled RNA/reads from labeled RNA
disproportionately lost, relative to the equivalent unlabeled species.
An EZbakRData object with the specified "fractions" table replaced
with a dropout corrected table.
# Simulate data to analyze simdata <- EZSimulate(30) # Create EZbakR input ezbdo <- EZbakRData(simdata$cB, simdata$metadf) # Estimate Fractions ezbdo <- EstimateFractions(ezbdo) # Correct for dropout ezbdo <- CorrectDropout(ezbdo)# Simulate data to analyze simdata <- EZSimulate(30) # Create EZbakR input ezbdo <- EZbakRData(simdata$cB, simdata$metadf) # Estimate Fractions ezbdo <- EstimateFractions(ezbdo) # Correct for dropout ezbdo <- CorrectDropout(ezbdo)
fraction_design table for EstimateFractions.A fraction_design table denotes what populations of labeled/unlabeled RNA are present in your data.
A fraction_design table as one column for each mutation type (e.g., TC) present
in your cB file, and one column named "present". Each entry is either TRUE or
FALSE. The rows include all possible combinations of TRUE and FALSE for all
mutation types columns. A value of TRUE in a mutation type column represents
a population of reads that have high amounts (on average) of that mutation type.
For example, if your fraction_design table has mutation type columns "TC" and
"GA", the row with TC == TRUE and GA == FALSE represents a population of reads
with high T-to-C mutation content and low G-to-A mutation content. In other words,
these are reads from RNA synthesized in the presence of s4U but not s6G. If such
a population exists in your data, the "present" column for that row should have a
value of TRUE.
create_fraction_design(mutrate_populations)create_fraction_design(mutrate_populations)
mutrate_populations |
Character vector of the set of mutational populations
present in your data. For example, s4U fed data with standard nucleotide recoding
chemistry (e.g., TimeLapse, SLAM, TUC, AMUC, etc.) would have a |
A fraction_design table that assumes that every possible combination of
mutational populations listed in mutrate_populations are present in your data.
The present column can be modified if this assumption is incorrect. This default
is chosen as it will in theory work for all analyses, it may just be unnecessarily
inefficient and estimate the abundance of populations that don't exist.
# Standard, single-label NR-seq fd <- create_fraction_design(c("TC")) # Dual-label NR-seq fd2 <- create_fraction_design(c("TC", "GA")) # Adjust dual-label output for TILAC fd2$present <- ifelse(fd2$TC & fd2$GA, FALSE, fd2$present)# Standard, single-label NR-seq fd <- create_fraction_design(c("TC")) # Dual-label NR-seq fd2 <- create_fraction_design(c("TC", "GA")) # Adjust dual-label output for TILAC fd2$present <- ifelse(fd2$TC & fd2$GA, FALSE, fd2$present)
Combines the output of EstimateFractions with feature
quantification performed by an outside tool (e.g., RSEM, kallisto, salmon, etc.)
to infer fraction news for features that reads cannot always be assigned to
unambiguously. This is a generalization of EstimateIsoformFractions
which performs this deconvolution for transcript isoforms.
DeconvolveFractions( obj, feature_type = c("gene", "isoform"), features = NULL, populations = NULL, fraction_design = NULL, repeatID = NULL, exactMatch = TRUE, fraction_name = NULL, quant_name = NULL, gene_to_transcript = NULL, overwrite = TRUE, TPM_min = 1, count_min = 10 )DeconvolveFractions( obj, feature_type = c("gene", "isoform"), features = NULL, populations = NULL, fraction_design = NULL, repeatID = NULL, exactMatch = TRUE, fraction_name = NULL, quant_name = NULL, gene_to_transcript = NULL, overwrite = TRUE, TPM_min = 1, count_min = 10 )
obj |
An |
feature_type |
Either |
features |
Character vector of the set of features you want to stratify
reads by and estimate proportions of each RNA population. The default of "all"
will use all feature columns in the |
populations |
Mutational populations that were analyzed to generate the fractions table to use. For example, this would be "TC" for a standard s4U-based nucleotide recoding experiment. |
fraction_design |
"Design matrix" specifying which RNA populations exist
in your samples. By default, this will be created automatically and will assume
that all combinations of the |
repeatID |
If multiple |
exactMatch |
If TRUE, then |
fraction_name |
Name of fraction estimate table to use. Should be stored in the
|
quant_name |
Name of transcript isoform quantification table to use. Should be stored
in the obj$readcounts list under this name. Use |
gene_to_transcript |
Table with columns |
overwrite |
If TRUE and a fractions estimate output already exists that would possess the same metadata (features analyzed, populations analyzed, and fraction_design), then it will get overwritten with the new output. Else, it will be saved as a separate output with the same name + "_#" where "#" is a numerical ID to distinguish the similar outputs. |
TPM_min |
Minimum TPM for a transcript to be kept in analysis. |
count_min |
Minimum expected_count for a transcript to be kept in analysis. |
DeconvolveFractions expects as input a "fractions" table with estimates
for fraction news of at least one convolved feature set. A convolved feature
set is one where some reads cannot be unambiguously assigned to one instance
of that feature type. For example, it is often impossible to assign short
reads to a single transcript isoform. Thus, something like the "TEC" assignment
provided by fastq2EZbakR is an instance of a convolved feature set, as it
is an assignment of reads to transcript isoforms with which they are compatible.
Another example is assignment to the exonic regions of genes, for fusion genes
(where a read may be consistent with both the putative fusion gene as well
as one of the fusion components).
DeconvolveFractions deconvolves fraction news
by fitting a linear mixing model to the convolved fraction new estimates +
feature abundance estimates. In other words, each convolved fraction new (data) is modeled
as a weighted average of single feature fraction news (parameters to estimate),
with the weights determined by the relative abundances of the features
in the convolved set (data). The convolved fraction new is modeled as a beta distribution with mean
equal to the weighted feature fraction new average and variance related
to the number of reads in the convolved feature set.
Features with estimated TPMs less than TPM_min (1 by default) or an estimated number of read
counts less than count_min (10 by default) are removed from convolved feature sets and will
not have their fraction news estimated.
An EZbakRData object with an additional table under the "fractions"
list. Has the same form as the output of EstimateFractions(), and will have the
feature column "transcript_id".
# Load dependencies library(dplyr) # Simulates a single sample worth of data simdata_iso <- SimulateIsoforms(nfeatures = 30) # We have to manually create the metadf in this case metadf <- tibble(sample = 'sampleA', tl = 4, condition = 'A') ezbdo <- EZbakRData(simdata_iso$cB, metadf) ezbdo <- EstimateFractions(ezbdo) ### Hack in the true, simulated isoform levels reads <- simdata_iso$ground_truth %>% dplyr::select(transcript_id, true_count, true_TPM) %>% dplyr::mutate(sample = 'sampleA', effective_length = 10000) %>% dplyr::rename(expected_count = true_count, TPM = true_TPM) # Name of table needs to have "isoform_quant" in it ezbdo[['readcounts']][['simulated_isoform_quant']] <- reads ### Perform deconvolution ezbdo <- DeconvolveFractions(ezbdo, feature_type = "isoform")# Load dependencies library(dplyr) # Simulates a single sample worth of data simdata_iso <- SimulateIsoforms(nfeatures = 30) # We have to manually create the metadf in this case metadf <- tibble(sample = 'sampleA', tl = 4, condition = 'A') ezbdo <- EZbakRData(simdata_iso$cB, metadf) ezbdo <- EstimateFractions(ezbdo) ### Hack in the true, simulated isoform levels reads <- simdata_iso$ground_truth %>% dplyr::select(transcript_id, true_count, true_TPM) %>% dplyr::mutate(sample = 'sampleA', effective_length = 10000) %>% dplyr::rename(expected_count = true_count, TPM = true_TPM) # Name of table needs to have "isoform_quant" in it ezbdo[['readcounts']][['simulated_isoform_quant']] <- reads ### Perform deconvolution ezbdo <- DeconvolveFractions(ezbdo, feature_type = "isoform")
The first step of any NR-seq analysis is to figure out the fraction of reads
from each mutational population in your data. For example, if you are performing
a standard SLAM-seq or TimeLapse-seq experiment, this means estimating the fraction
of reads with high T-to-C mutation content, and the fraction with low T-to-C mutation
content. This is what EstimateFractions is for.
EstimateFractions( obj, features = "all", mutrate_populations = "all", fraction_design = NULL, Poisson = TRUE, strategy = c("standard", "hierarchical"), filter_cols = "all", filter_condition = `&`, remove_features = c("NA", "__no_feature"), split_multi_features = FALSE, multi_feature_cols = NULL, multi_feature_sep = "+", pnew_prior_mean = -2.94, pnew_prior_sd = 0.3, pold_prior_mean = -6.5, pold_prior_sd = 0.5, hier_readcutoff = 300, init_pnew_prior_sd = 0.8, pnew_prior_sd_min = 0.01, pnew_prior_sd_max = 0.15, pold_est = NULL, pold_from_nolabel = FALSE, grouping_factors = NULL, character_limit = 20, overwrite = TRUE ) ## S3 method for class 'EZbakRData' EstimateFractions( obj, features = "all", mutrate_populations = "all", fraction_design = NULL, Poisson = TRUE, strategy = c("standard", "hierarchical"), filter_cols = "all", filter_condition = `&`, remove_features = c("NA", "__no_feature"), split_multi_features = FALSE, multi_feature_cols = NULL, multi_feature_sep = "+", pnew_prior_mean = -2.94, pnew_prior_sd = 0.3, pold_prior_mean = -6.5, pold_prior_sd = 0.5, hier_readcutoff = 300, init_pnew_prior_sd = 0.3, pnew_prior_sd_min = 0.01, pnew_prior_sd_max = 0.15, pold_est = NULL, pold_from_nolabel = FALSE, grouping_factors = NULL, character_limit = 20, overwrite = TRUE ) ## S3 method for class 'EZbakRArrowData' EstimateFractions( obj, features = "all", mutrate_populations = "all", fraction_design = NULL, Poisson = TRUE, strategy = c("standard", "hierarchical"), filter_cols = "all", filter_condition = `&`, remove_features = c("NA", "__no_feature"), split_multi_features = FALSE, multi_feature_cols = NULL, multi_feature_sep = "+", pnew_prior_mean = -2.94, pnew_prior_sd = 0.3, pold_prior_mean = -6.5, pold_prior_sd = 0.5, hier_readcutoff = 300, init_pnew_prior_sd = 0.8, pnew_prior_sd_min = 0.01, pnew_prior_sd_max = 0.15, pold_est = NULL, pold_from_nolabel = FALSE, grouping_factors = NULL, character_limit = 20, overwrite = TRUE )EstimateFractions( obj, features = "all", mutrate_populations = "all", fraction_design = NULL, Poisson = TRUE, strategy = c("standard", "hierarchical"), filter_cols = "all", filter_condition = `&`, remove_features = c("NA", "__no_feature"), split_multi_features = FALSE, multi_feature_cols = NULL, multi_feature_sep = "+", pnew_prior_mean = -2.94, pnew_prior_sd = 0.3, pold_prior_mean = -6.5, pold_prior_sd = 0.5, hier_readcutoff = 300, init_pnew_prior_sd = 0.8, pnew_prior_sd_min = 0.01, pnew_prior_sd_max = 0.15, pold_est = NULL, pold_from_nolabel = FALSE, grouping_factors = NULL, character_limit = 20, overwrite = TRUE ) ## S3 method for class 'EZbakRData' EstimateFractions( obj, features = "all", mutrate_populations = "all", fraction_design = NULL, Poisson = TRUE, strategy = c("standard", "hierarchical"), filter_cols = "all", filter_condition = `&`, remove_features = c("NA", "__no_feature"), split_multi_features = FALSE, multi_feature_cols = NULL, multi_feature_sep = "+", pnew_prior_mean = -2.94, pnew_prior_sd = 0.3, pold_prior_mean = -6.5, pold_prior_sd = 0.5, hier_readcutoff = 300, init_pnew_prior_sd = 0.3, pnew_prior_sd_min = 0.01, pnew_prior_sd_max = 0.15, pold_est = NULL, pold_from_nolabel = FALSE, grouping_factors = NULL, character_limit = 20, overwrite = TRUE ) ## S3 method for class 'EZbakRArrowData' EstimateFractions( obj, features = "all", mutrate_populations = "all", fraction_design = NULL, Poisson = TRUE, strategy = c("standard", "hierarchical"), filter_cols = "all", filter_condition = `&`, remove_features = c("NA", "__no_feature"), split_multi_features = FALSE, multi_feature_cols = NULL, multi_feature_sep = "+", pnew_prior_mean = -2.94, pnew_prior_sd = 0.3, pold_prior_mean = -6.5, pold_prior_sd = 0.5, hier_readcutoff = 300, init_pnew_prior_sd = 0.8, pnew_prior_sd_min = 0.01, pnew_prior_sd_max = 0.15, pold_est = NULL, pold_from_nolabel = FALSE, grouping_factors = NULL, character_limit = 20, overwrite = TRUE )
obj |
|
features |
Character vector of the set of features you want to stratify
reads by and estimate proportions of each RNA population. The default of "all"
will use all feature columns in the |
mutrate_populations |
Character vector of the set of mutational populations that you want to infer the rates of mutations for. By default, all mutation rates are estimated for all populations present in cB. |
fraction_design |
"Design matrix" specifying which RNA populations exist
in your samples. By default, this will be created automatically and will assume
that all combinations of the If you call the function
For example, assume you are doing a typical TimeLapse-seq/SLAM-seq/TUC-seq/etc. experiment
where you have fed cells with s^4U and recoded any incorporated s^4U to a nucleotide
that reverse transcriptase will read as a cytosine. That means that As another example, consider TILAC, a NR-seq extension developed by the Simon lab. TILAC was originally
described in Courvan et al., 2022. In this
method, two populations of RNA, one from s^4U fed cells and one from s^6G fed cells, are pooled
and prepped for sequencing together. This allows for internally controlled comparisons of RNA
abundance without spike-ins. s^4U is recoded to a cytosine analog by TimeLapse chemistry
(or similar chemistry) and s^6G is recoded to an adenine analog. Thus, |
Poisson |
If |
strategy |
String denoting which new read mutation rate estimation strategy to use. Options include:
|
filter_cols |
Which feature columns should be used to filter out feature-less reads. The default value of "all" checks all feature columns for whether or not a read failed to get assigned to said feature. |
filter_condition |
Only two possible values for this make sense: |
remove_features |
All of the feature names that could indicate failed assignment of a read to a given feature. the fastq2EZbakR pipeline uses a value of '__no_feature'. |
split_multi_features |
If a set of reads maps ambiguously to multiple features,
should data for such reads be copied for each feature in the ambiguous set? If this is
|
multi_feature_cols |
Character vector of columns that have the potential to
include assignment to multiple features. Only these columns will have their features split
if |
multi_feature_sep |
String representing how ambiguous feature assignments are distinguished in the feature names. For example, the default value of "+" denotes that if a read maps to multiple features (call them featureA and featureB, for example), then the feature column will have a value of "featureA+featureB" for that read. |
pnew_prior_mean |
Mean for logit(pnew) prior. |
pnew_prior_sd |
Standard deviation for logit(pnew) prior. |
pold_prior_mean |
Mean for logit(pold) prior. |
pold_prior_sd |
Standard deviation for logit(pold) prior. |
hier_readcutoff |
If |
init_pnew_prior_sd |
If |
pnew_prior_sd_min |
The minimum logit(pnew) prior standard deviation when |
pnew_prior_sd_max |
Similar to |
pold_est |
Background mutation rate estimates if you have them. Can either be a single number applied to all samples or a named vector of values, where the names should be sample names. |
pold_from_nolabel |
Fix background mutation rate estimate to mutation rates seen in -label samples.
By default, a single background rate is used for all samples, inferred from the average mutation rate
across all -label samples. The |
grouping_factors |
If |
character_limit |
Maximum number of characters for naming out fractions output. EZbakR
will try to name this as a "_" separated character vector of all of the features analyzed.
If this name is greater than |
overwrite |
If TRUE and a fractions estimate output already exists that would possess the same metadata (features analyzed, populations analyzed, and fraction_design), then it will get overwritten with the new output. Else, it will be saved as a separate output with the same name + "_#" where "#" is a numerical ID to distinguish the similar outputs. |
EstimateFractions uses mixture modeling to estimate the fraction of reads from
each mutational population in your data, and this is done for each feature in your
data (i.e., combination of columns that specify genomic features from which reads
were derived). The set of mutational populations in your
data can be specified by providing a fraction_design object, described in more
depth above (also see ?create_fraction_design). There are several flavors
of mixture modeling that can be performed by EstimateFractions. These are
as follows:
The default: global mutation rate parameters (global = same for all reads from all features) are estimated for each sample by fitting a single two-component mixture model to all of the reads in that sample. These are used to estimate the fraction of reads from each feature that are from each mutational population, also using a two-component mixture model.
With Poisson set to TRUE, this is a nucleotide-content adjusted Poisson
mixture model, which is a more efficient alternative to binomial mixture modeling.
With Poisson set to FALSE, this is a binomial mixture model.
Low mutation rate from -label: if pold_from_nolabel is TRUE, then the background,
no label mutation rates are estimated from -label samples. By default, a single set of
background mutation rates are estimated for all samples, but you can change this behavior
by setting grouping_factors to specify columns in your metadf by which samples
should be stratified.
This is a great strategy to use if you have low label incorporation rates or if you used a fairly short label time.
Hierarchical model: if strategy == "hierarchical", which is currently
only compatible for single-mutation type modeling (e.g., standard T-to-C mutation
modeling), then high T-to-C content mutation rates are estimated for each feature.
The global sample-wide estimates are used as an informative prior to increase the
accuracy of this process by avoiding extreme estimates.
An EZbakRFractions object, which is just an EZbakRData object
with a "fractions" list element. This will include tables of fraction estimates
(e.g., fraction of reads from the high T-to-C mutation rate population in a
standard single-label s4U experiment; termed fraction_highTC in the table) for
all features in all samples.
EstimateFractions(EZbakRData): Method for class EZbakRData
Estimates fractions using an entirely in memory object.
EstimateFractions(EZbakRArrowData): Mehthod for class EZbakRArrowData
This is an alternative to the fully in memory EZbakRData EstimateFractions method that can help
with analyses of larger than RAM datasets. The provided "cB" is expected to be
an on-disk Arrow Dataset. Furthermore, it is expected to be partitioned by the
sample name, which will allow this method to read only a single-sample worth
of data into memory at a time. This can significantly reduce RAM usage. Input
object should be created with EZbakRArrowData().
# Simulate data to analyze simdata <- SimulateOneRep(30) # Create EZbakR input metadf <- data.frame(sample = "sampleA", tl = 2) ezbdo <- EZbakRData(simdata$cB, metadf) # Estimate fractions ezbdo <- EstimateFractions(ezbdo)# Simulate data to analyze simdata <- SimulateOneRep(30) # Create EZbakR input metadf <- data.frame(sample = "sampleA", tl = 2) ezbdo <- EZbakRData(simdata$cB, metadf) # Estimate fractions ezbdo <- EstimateFractions(ezbdo)
Combines the output of EstimateFractions with transcript isoform
quantification performed by an outside tool (e.g., RSEM, kallisto, salmon, etc.)
to infer transcript isoform-specific fraction news (or more generally fraction
of reads coming from a particular mutation population).
EstimateIsoformFractions( obj, features = NULL, populations = NULL, fraction_design = NULL, repeatID = NULL, exactMatch = TRUE, fraction_name = NULL, quant_name = NULL, gene_to_transcript = NULL, overwrite = TRUE, TPM_min = 1, count_min = 10 )EstimateIsoformFractions( obj, features = NULL, populations = NULL, fraction_design = NULL, repeatID = NULL, exactMatch = TRUE, fraction_name = NULL, quant_name = NULL, gene_to_transcript = NULL, overwrite = TRUE, TPM_min = 1, count_min = 10 )
obj |
An |
features |
Character vector of the set of features you want to stratify
reads by and estimate proportions of each RNA population. The default of "all"
will use all feature columns in the |
populations |
Mutational populations that were analyzed to generate the fractions table to use. For example, this would be "TC" for a standard s4U-based nucleotide recoding experiment. |
fraction_design |
"Design matrix" specifying which RNA populations exist
in your samples. By default, this will be created automatically and will assume
that all combinations of the |
repeatID |
If multiple |
exactMatch |
If TRUE, then |
fraction_name |
Name of fraction estimate table to use. Should be stored in the
|
quant_name |
Name of transcript isoform quantification table to use. Should be stored
in the obj$readcounts list under this name. Use |
gene_to_transcript |
Table with columns |
overwrite |
If TRUE and a fractions estimate output already exists that would possess the same metadata (features analyzed, populations analyzed, and fraction_design), then it will get overwritten with the new output. Else, it will be saved as a separate output with the same name + "_#" where "#" is a numerical ID to distinguish the similar outputs. |
TPM_min |
Minimum TPM for a transcript to be kept in analysis. |
count_min |
Minimum expected_count for a transcript to be kept in analysis. |
EstimateIsoformFractions expects as input a "fractions" table with estimates
for transcript equivalence class (TEC) fraction news. A transcript equivalence class
is the set of transcript isoforms with which a sequencing read is compatible.
fastq2EZbakR is able to assign
reads to these equivalence classes so that EZbakR can estimate the fraction of
reads in each TEC that are from labeled RNA.
EstimateIsoformFractions estimates transcript isoform fraction news
by fitting a linear mixing model to the TEC fraction new estimates + transcript
isoform abundance estimates. In other words, each TEC fraction new (data) is modeled
as a weighted average of transcript isoform fraction news (parameters to estimate),
with the weights determined by the relative abundances of the transcript isoforms
in the TEC (data). The TEC fraction new is modeled as a Beta distribution with mean
equal to the weighted transcript isoform fraction new average and variance related
to the number of reads in the TEC.
Transcript isoforms with estimated TPMs less than with an estimated
TPM greater than TPM_min (1 by default) or an estimated number of read
counts less than count_min (10 by default) are removed from TECs and will
not have their fraction news estimated.
An EZbakRData object with an additional table under the "fractions"
list. Has the same form as the output of EstimateFractions(), and will have the
feature column "transcript_id".
# Load dependencies library(dplyr) # Simulates a single sample worth of data simdata_iso <- SimulateIsoforms(nfeatures = 100) # We have to manually create the metadf in this case metadf <- tibble(sample = 'sampleA', tl = 4, condition = 'A') ezbdo <- EZbakRData(simdata_iso$cB, metadf) ezbdo <- EstimateFractions(ezbdo) ### Hack in the true, simulated isoform levels reads <- simdata_iso$ground_truth %>% dplyr::select(transcript_id, true_count, true_TPM) %>% dplyr::mutate(sample = 'sampleA', effective_length = 10000) %>% dplyr::rename(expected_count = true_count, TPM = true_TPM) # Name of table needs to have "isoform_quant" in it ezbdo[['readcounts']][['simulated_isoform_quant']] <- reads ### Perform deconvolution ezbdo <- EstimateIsoformFractions(ezbdo)# Load dependencies library(dplyr) # Simulates a single sample worth of data simdata_iso <- SimulateIsoforms(nfeatures = 100) # We have to manually create the metadf in this case metadf <- tibble(sample = 'sampleA', tl = 4, condition = 'A') ezbdo <- EZbakRData(simdata_iso$cB, metadf) ezbdo <- EstimateFractions(ezbdo) ### Hack in the true, simulated isoform levels reads <- simdata_iso$ground_truth %>% dplyr::select(transcript_id, true_count, true_TPM) %>% dplyr::mutate(sample = 'sampleA', effective_length = 10000) %>% dplyr::rename(expected_count = true_count, TPM = true_TPM) # Name of table needs to have "isoform_quant" in it ezbdo[['readcounts']][['simulated_isoform_quant']] <- reads ### Perform deconvolution ezbdo <- EstimateIsoformFractions(ezbdo)
Simple kinetic parameter (kdeg and ksyn) estimation using fraction estimates
from EstimateFractions(). Several slightly different kinetic parameter
inference strategies are implemented.
EstimateKinetics( obj, strategy = c("standard", "tilac", "NSS", "shortfeed", "pulse-chase"), features = NULL, populations = NULL, fraction_design = NULL, repeatID = NULL, exactMatch = TRUE, grouping_factor = NULL, reference_factor = NULL, character_limit = 20, feature_lengths = NULL, exclude_pulse_estimates = TRUE, scale_factors = NULL, overwrite = TRUE )EstimateKinetics( obj, strategy = c("standard", "tilac", "NSS", "shortfeed", "pulse-chase"), features = NULL, populations = NULL, fraction_design = NULL, repeatID = NULL, exactMatch = TRUE, grouping_factor = NULL, reference_factor = NULL, character_limit = 20, feature_lengths = NULL, exclude_pulse_estimates = TRUE, scale_factors = NULL, overwrite = TRUE )
obj |
An |
strategy |
Kinetic parameter estimation strategy. Options include:
|
features |
Character vector of the set of features you want to stratify
reads by and estimate proportions of each RNA population. The default of "all"
will use all feature columns in the |
populations |
Mutational populations that were analyzed to generate the fractions table to use. For example, this would be "TC" for a standard s4U-based nucleotide recoding experiment. |
fraction_design |
"Design matrix" specifying which RNA populations exist
in your samples. By default, this will be created automatically and will assume
that all combinations of the |
repeatID |
If multiple |
exactMatch |
If TRUE, then |
grouping_factor |
Which sample-detail columns in the metadf should be used
to group samples by for calculating the average RPM ( |
reference_factor |
Which sample-detail column in the metafd should be used to
determine which group of samples provide information about the starting levels of RNA
for each sample? This should have values that match those in |
character_limit |
Maximum number of characters for naming out fractions output. EZbakR
will try to name this as a "_" separated character vector of all of the features analyzed.
If this name is greater than |
feature_lengths |
Table of effective lengths for each feature combination in your data. For example, if your analysis includes features named GF and XF, this should be a data frame with columns GF, XF, and length. |
exclude_pulse_estimates |
If |
scale_factors |
Data frame with two columns, "sample" and "scale_factor".
sample should be the sample name, and scale_factor is the factor which read counts
in that sample should be divided to get the normalized read counts. This should match
the convention from DESeq2 and thus you can provide, for example, scale factors derived
from DESeq2 for this. By default, EZbakR uses DESeq2's median of ratios normalization
method, but specifying |
overwrite |
If TRUE and a fractions estimate output already exists that would possess the same metadata (features analyzed, populations analyzed, and fraction_design), then it will get overwritten with the new output. Else, it will be saved as a separate output with the same name + "_#" where "#" is a numerical ID to distinguish the similar outputs. |
EstimateKinetics() estimates the kinetics and RNA synthesis and degradation
from standard single-label NR-seq data. It also technically supports analyses
of the dual-label TILAC experiment, but that functionality is not as actively
tested or maintained as the more standard analyses.
EstimateKinetics() assumes a simple linear ODE model of RNA dynamics:
where:
R = Amount of RNA
= Synthesis rate constant
= Degradation rate constant
and for which the general solution is:
where is the initial RNA level.
The default kinetic parameter estimation strategy (strategy == "standard")
assumes that for labeled RNA (or more precisely RNA synthesized in the presence
of label) . Thus, it assumes a pulse-label rather than
pulse-chase experimental design (I've written several places, like here
and here
for example, about why pulse-label designs are superior to pulse-chases in
almost all settings).
For unlabeled RNA, it assumes that .
This is known as the steady-state assumption and is the key assumption of this method.
More broadly, assuming steady-state means that this method assumes that RNA levels
from a given feature are constant for the entire metabolic labeling duration.
As EZbakR is designed for analyses of bulk NR-seq data, "constant" means that the
average RNA levels across all profiled cells is constant (thus asynchronous populations
of dividing cells still count as steady-state, even if the RNA expression landscapes
in individual cells are quite dynamic). This assumption may be violated in cases
where labeling coincides with or closely follows a perturbation (e.g., drug treatment).
When the steady-state assumption hold, there is a simple relationship between the
fraction of reads from labeled RNA () and the turnover kinetics of the RNA:
The power in this approach is thus that turnover kinetics are estimated from
an "internally normalized" quantity: (termed the "fraction new",
"fraction labeled", "fraction high mutation content", or new-to-total ratio (NTR), depending
on where you look or who you ask). "Internally normalized" means that a normalization
scale factor does not need to be estimated in order to accurately infer turnover
kinetics. represents a ratio of read counts from the same feature
in the same library, thus any scale factor would appear in both the numerator
and denominator of this estimate and cancel out.
When this assumption is valid, the strategy = "NSS" approach may be preferable.
This approach relies on the same model of RNA dynamics, but now assumes that the
initial RNA levels ( for the unlabeled RNA) are not at the steady-state
levels dictated by the current synthesis and turnover kinetics. The idea for this
strategy was first presented in Narain et al. 2021.
To run this approach, you need to be able to estimate the initial RNA levels of
each label-fed sample, as the fraction of reads from labeled RNA will no longer.
only reflect the turnover kinetics (as the old RNA is assumed to potentially
not have reached the new steady-state levels yet). This means that for each
label-fed sample, you need to have a sample whose assayed RNA population represents
the starting RNA population for the label-fed sample. EZbakR will use these
reference samples to infer for unlabeled RNAs and estimate
turnover kinetics from the ratio of this quantity to the remaining unlabeled RNA levels
after labeling:
While I really like the idea of this strategy, it is not without some severe limitations. For one, the major benefit of NR-seq, internally normalized estimation of turnover kinetics, is out the window. Thus, read counts need to be normalized between the relevant reference and label-fed sample pairs. In addition, the variance patterns of this ratio are somewhat unfortunate. Its variance is incredibly high for more stable RNAs, severely limiting the effective dynamic range of this approach relative to steady-state analyses. I continue to work to refine EZbakR's implementation of this approach, but for now I urge caution in its use and interpretation. See my discussion of an alternative approach for navigating non-steady-state data here.
As an aside, you may wonder how this strategy deals with dynamic regulation of synthesis and degradation rate constants during labeling. To answer this, you have to realize that the duration of metabolic labeling represents an integration time over which the best we can do is assess average kinetics. Thus, if rate constants are changing during the labeling, this strategy can still be thought of as providing a an estimate of the time averaged synthesis and turnover kinetics during the label time.
Eventually, I will get around to implementing pulse-chase support in EstimateKinetics().
I haven't yet because I don't like pulse-chase experiments and think they are
way more popular than they should be for purely historical reasons. But lots
of people keep doing pulse-chase NR-seq so c'est la vie.
EZbakRKinetics object, which is just an EZbakRData object with
a "kinetics" slot. This includes tables of kinetic parameter estimates for
each feature in each sample for which kinetic parameters can be estimated.
# Simulate data to analyze simdata <- SimulateOneRep(30) # Create EZbakR input metadf <- data.frame(sample = "sampleA", tl = 2) ezbdo <- EZbakRData(simdata$cB, metadf) # Estimate Fractions ezbdo <- EstimateFractions(ezbdo) # Estimate Kinetics ezbdo <- EstimateKinetics(ezbdo)# Simulate data to analyze simdata <- SimulateOneRep(30) # Create EZbakR input metadf <- data.frame(sample = "sampleA", tl = 2) ezbdo <- EZbakRData(simdata$cB, metadf) # Estimate Fractions ezbdo <- EstimateFractions(ezbdo) # Estimate Kinetics ezbdo <- EstimateKinetics(ezbdo)
Two component mixture models are fit to all data to estimate global high and low mutation rates for all samples. Estimation of these mutation rates are regularized through the use of weakly informative priors whose parameters can be altered using the arguments defined below.
EstimateMutRates( obj, populations = "all", pnew_prior_mean = -2.94, pnew_prior_sd = 0.3, pold_prior_mean = -6.5, pold_prior_sd = 0.5, pold_est = NULL, pold_from_nolabel = FALSE, grouping_factors = NULL ) ## S3 method for class 'EZbakRData' EstimateMutRates( obj, populations = "all", pnew_prior_mean = -2.94, pnew_prior_sd = 0.3, pold_prior_mean = -6.5, pold_prior_sd = 0.5, pold_est = NULL, pold_from_nolabel = FALSE, grouping_factors = NULL ) ## S3 method for class 'EZbakRArrowData' EstimateMutRates( obj, populations = "all", pnew_prior_mean = -2.94, pnew_prior_sd = 0.3, pold_prior_mean = -6.5, pold_prior_sd = 0.5, pold_est = NULL, pold_from_nolabel = FALSE, grouping_factors = NULL )EstimateMutRates( obj, populations = "all", pnew_prior_mean = -2.94, pnew_prior_sd = 0.3, pold_prior_mean = -6.5, pold_prior_sd = 0.5, pold_est = NULL, pold_from_nolabel = FALSE, grouping_factors = NULL ) ## S3 method for class 'EZbakRData' EstimateMutRates( obj, populations = "all", pnew_prior_mean = -2.94, pnew_prior_sd = 0.3, pold_prior_mean = -6.5, pold_prior_sd = 0.5, pold_est = NULL, pold_from_nolabel = FALSE, grouping_factors = NULL ) ## S3 method for class 'EZbakRArrowData' EstimateMutRates( obj, populations = "all", pnew_prior_mean = -2.94, pnew_prior_sd = 0.3, pold_prior_mean = -6.5, pold_prior_sd = 0.5, pold_est = NULL, pold_from_nolabel = FALSE, grouping_factors = NULL )
obj |
An |
populations |
Character vector of the set of mutational populations that you want to infer the fractions of. For example, say your cB file contains columns tracking T-to-C and G-to-A |
pnew_prior_mean |
logit-Normal mean for logit(pnew) prior. |
pnew_prior_sd |
logit-Normal sd for logit(pnew) prior. |
pold_prior_mean |
logit-Normal mean for logit(pold) prior. |
pold_prior_sd |
logit-Normal sd for logit(pold) prior. |
pold_est |
Background mutation rate estimates if you have them. Can either be a single number applied to all samples or a named vector of values, where the names should be sample names. |
pold_from_nolabel |
Fix background mutation rate estimate to mutation rates seen in -label samples.
By default, a single background rate is used for all samples, inferred from the average mutation rate
across all -label samples. The |
grouping_factors |
If |
EZbakRData object with an added mutrates slot containing estimated
high and low mutation rates for each mutation type modeled.
EstimateMutRates(EZbakRData): Method for class EZbakRData
Estimates mutation rates using a fully in memory object.
EstimateMutRates(EZbakRArrowData): Method for class EZbakRArrowData
Estimate mutation rates using a partially on-disk object.
# Simulate data to analyze simdata <- SimulateOneRep(30) # Create EZbakR input metadf <- data.frame(sample = "sampleA", tl = 2) ezbdo <- EZbakRData(simdata$cB, metadf) # Estimate mutation rates mutrates <- EstimateMutRates(ezbdo)# Simulate data to analyze simdata <- SimulateOneRep(30) # Create EZbakR input metadf <- data.frame(sample = "sampleA", tl = 2) ezbdo <- EZbakRData(simdata$cB, metadf) # Estimate mutation rates mutrates <- EstimateMutRates(ezbdo)
An example cB table used to create an EZbakRData object. This cB table is a
subset of a cB file from the DCP2 KO dataset published in Luo et al., 2020.
The original file is large (69 MB), so the example cB file has been
downsampled and contains only a subset of reads from chromosome 21.
example_cBexample_cB
example_cBA tibble with 10,000 rows and 7 columns:
Sample name
Chromosome name
Gene name for reads aligning to any region of a gene
Gene name for reads aligning to exclusively exonic regions of a gene
Number of T-to-C mutations
Number of Ts
Number of reads with same value for the first 6 columns
Luo et al. (2020) Biochemistry. 59(42), 4121-4142
An example metadf table used to create an EZbakRData object. This metadf
describes the DCP2 KO dataset published in Luo et al., 2020.
example_metadfexample_metadf
example_metadfA tibble with 6 rows and 3 columns
Sample name
Metabolic label feed time
Whether sample was collected from WT or DCP2 KO cells
Luo et al. (2020) Biochemistry. 59(42), 4121-4142
EZbakRArrowData object helper function for usersEZbakRArrowData creates an object of class EZbakRArrowData and checks the validity
of the provided input.
EZbakRArrowData(cBds, metadf)EZbakRArrowData(cBds, metadf)
cBds |
ArrowDataset with the following fields:
|
metadf |
Data frame detailing various aspects of each of the samples included in the cBds. This includes:
|
An EZbakRArrowData object. This is simply a list of the provide cBds and
metadf with class EZbakRArrowData
# Load dependency library(arrow) simdata <- EZSimulate(30) # Create directory to write dataset to outdir <- tempdir() dataset_dir <- file.path(outdir, "arrow_dataset") # Create dataset write_dataset( simdata$cB, path = dataset_dir, format = "parquet", partitioning = "sample" ) # Create EZbakRArrowData object ds <- open_dataset(dataset_dir) ezbdo <- EZbakRArrowData(ds, simdata$metadf)# Load dependency library(arrow) simdata <- EZSimulate(30) # Create directory to write dataset to outdir <- tempdir() dataset_dir <- file.path(outdir, "arrow_dataset") # Create dataset write_dataset( simdata$cB, path = dataset_dir, format = "parquet", partitioning = "sample" ) # Create EZbakRArrowData object ds <- open_dataset(dataset_dir) ezbdo <- EZbakRArrowData(ds, simdata$metadf)
EZbakRData object helper function for usersEZbakRData creates an object of class EZbakRData and checks the validity
of the provided input.
EZbakRData(cB, metadf)EZbakRData(cB, metadf)
cB |
Data frame with the following columns:
|
metadf |
Data frame detailing various aspects of each of the samples included in the cB. This includes:
|
An EZbakRData object. This is simply a list of the provide cB and
metadf with class EZbakRData
# Simulate data simdata <- EZSimulate(30) # Create EZbakRData object ezbdo <- EZbakRData(simdata$cB, simdata$metadf)# Simulate data simdata <- EZSimulate(30) # Create EZbakRData object ezbdo <- EZbakRData(simdata$cB, simdata$metadf)
EZbakRFractions helper function for usersEZbakRFractions creates an object of class EZbakRFractions and checks the validity
of the provided input.
EZbakRFractions(fractions, metadf, name = NULL, character_limit = 20)EZbakRFractions(fractions, metadf, name = NULL, character_limit = 20)
fractions |
Data frame with the following columns:
|
metadf |
Data frame detailing various aspects of each of the samples included in the fractions data frame. This includes:
|
name |
Optional; name to give to fractions table. |
character_limit |
Maximum number of characters for naming out fractions output. EZbakR
will try to name this as a "_" separated character vector of all of the features analyzed.
If this name is greater than |
An EZbakRFractions object. This is simply a list of the provide fractions and
metadf with class EZbakRFractions
# Simulate data simdata <- EZSimulate(30) # Get fractions table by estimating (for demonstration) ezbdo <- EZbakRData(simdata$cB, simdata$metadf) ezbdo <- EstimateFractions(ezbdo) fxns <- EZget(ezbdo, type = "fractions") # Create EZbakRFractions object ezbfo <- EZbakRFractions(fxns, simdata$metadf)# Simulate data simdata <- EZSimulate(30) # Get fractions table by estimating (for demonstration) ezbdo <- EZbakRData(simdata$cB, simdata$metadf) ezbdo <- EstimateFractions(ezbdo) fxns <- EZget(ezbdo, type = "fractions") # Create EZbakRFractions object ezbfo <- EZbakRFractions(fxns, simdata$metadf)
EZbakRKinetics helper function for usersEZbakRKinetics creates an object of class EZbakRKinetics and checks the validity
of the provided input.
EZbakRKinetics(kinetics, metadf, features, name = NULL, character_limit = 20)EZbakRKinetics(kinetics, metadf, features, name = NULL, character_limit = 20)
kinetics |
Data frame with the following columns:
|
metadf |
Data frame detailing various aspects of each of the samples included in the kinetics data frame. This includes:
|
features |
Features tracked in |
name |
Optional; name to give to fractions table. |
character_limit |
If name is chosen automatically, limit on the number of
characters in said name. If default name yields a string longer than this,
then kinetics table will be named |
An EZbakRKinetics object. This is simply a list of the provided kinetics and
metadf with class EZbakRKinetics
# Simulate data simdata <- EZSimulate(30) # Get kinetics table by estimating (for demonstration) ezbdo <- EZbakRData(simdata$cB, simdata$metadf) ezbdo <- EstimateFractions(ezbdo) ezbdo <- EstimateKinetics(ezbdo) kinetics <- EZget(ezbdo, type = "kinetics") # Create EZbakRKinetics object ezbko <- EZbakRKinetics(kinetics, simdata$metadf, features = "feature")# Simulate data simdata <- EZSimulate(30) # Get kinetics table by estimating (for demonstration) ezbdo <- EZbakRData(simdata$cB, simdata$metadf) ezbdo <- EstimateFractions(ezbdo) ezbdo <- EstimateKinetics(ezbdo) kinetics <- EZget(ezbdo, type = "kinetics") # Create EZbakRKinetics object ezbko <- EZbakRKinetics(kinetics, simdata$metadf, features = "feature")
EZDynamics() estimates parameters of a user-specified dynamical systems
model. The dynamical system model is specified through an adjacency matrix,
which is an NxN matrix described below (see graph documentation). Modeling
can either be done for species all assayed in each sample, or species that
are assayed across a set of independent samples (e.g., subcellular fractionation
involves assaying different species in different samples).
EZDynamics( obj, graph, sub_features, grouping_features, scale_factors = NULL, sample_feature = NULL, modeled_to_measured = NULL, parameter_names = paste0("logk", 1:max(graph)), unassigned_name = "__no_feature", type = "averages", prior_means = rep(-3, times = max(graph)), prior_sds = c(3, rep(1, times = max(graph) - 1)), avg_params_tokeep = NULL, avg_params_todrop = NULL, label_time_name = "tl", features = NULL, populations = NULL, fraction_design = NULL, parameter = NULL, repeatID = NULL, exactMatch = TRUE, feature_lengths = NULL, use_coverage = TRUE, normalization_repeatID = NULL, normalization_exactMatch = TRUE, species_to_sf = NULL, overwrite = TRUE )EZDynamics( obj, graph, sub_features, grouping_features, scale_factors = NULL, sample_feature = NULL, modeled_to_measured = NULL, parameter_names = paste0("logk", 1:max(graph)), unassigned_name = "__no_feature", type = "averages", prior_means = rep(-3, times = max(graph)), prior_sds = c(3, rep(1, times = max(graph) - 1)), avg_params_tokeep = NULL, avg_params_todrop = NULL, label_time_name = "tl", features = NULL, populations = NULL, fraction_design = NULL, parameter = NULL, repeatID = NULL, exactMatch = TRUE, feature_lengths = NULL, use_coverage = TRUE, normalization_repeatID = NULL, normalization_exactMatch = TRUE, species_to_sf = NULL, overwrite = TRUE )
obj |
Currently must be an EZbakRData object on which |
graph |
An NxN adjacency matrix, where N represents the number of species being modeled. One of these species must be called "0" and represent the "no RNA" species. This is the species from which some species are synthesized (e.g., 0 -> P, means premature RNA is synthesized from no RNA), and the species to which some species are degraded (e.g., M -> 0 means mature RNA is converted to "no RNA" via degradation). The rows and columns of this matrix must be the names of all modled species, and rownames(graph) == colnames(graph). Entry i,j of the matrix is either 0 if species i cannot be converted into species j under your model, and an integer from 1:npars (where npars = total number of parameters to be estimated) if it can. For example, the model 0 -> P -> M -> 0 would have the |
sub_features |
Which feature columns distinguish between the different
measured species? Note, the measured species need not have the same name,
and may not be directly equivalent to, the modeled species. The relationship
between the modeled species in |
grouping_features |
Which features are the overarching feature assignments
by which |
scale_factors |
Data frame mapping samples to factors by which to multiply read counts so as ensure proper normalization between different RNA populations. Only relevant if you are modeling relationships between distinct RNA populations, for example RNA from nuclear and cytoplasmic fractions. Will eventually be inferred automatically. |
sample_feature |
If different samples involve assaying different species,
this must be the name of the metadf column that distinguishes the different
classes of samples. For example, if analyzing a subcellular fractionation
dataset, you likely included a column in your metadf called "compartment".
This would then be your |
modeled_to_measured |
If For example, if your model is 0 -> P -> M -> 0, where P stands for premature RNA
and M stands for mature RNA, and you have a column called
GF that corresponds to assignment anywhere in the gene, and XF that corresponds
to assignment of exclusively exon mapping reads, then your As another example, if your model is 0 -> N -> C -> 0, where N stands for
"nuclear RNA" and C stands for "cytoplasmic RNA", and your |
parameter_names |
Vector of names you would like to give to the estimated
parameters. ith element should correspond to name of parameter given the ID
i in |
unassigned_name |
What value will a |
type |
What type of table would you like to use? Currently only supports "averages", but will support "fractions" in the near future. |
prior_means |
Mean of log-Normal prior for kinetic parameters. Should
be vector where ith value is mean for ith parameter, i = index in |
prior_sds |
Std. dev. of log-Normal prior for kinetic parameter.
Should be vector where ith value is mean for ith parameter, i = index in |
avg_params_tokeep |
Names of parameters in averages table that you would like to keep. Other parameters will be discarded. Don't include the prefixes "mean_", "sd_", or "coverage_"; these will be imputed automatically. In other words, this should be the base parameter names. |
avg_params_todrop |
Names of parameters in averages table that you would like to drop. Other parameters will be kept. Don't include the prefixes "mean_", "sd_", or "coverage_"; these will be imputed automatically. In other words, this should be the base parameter names. |
label_time_name |
Name of relevant label time column that will be found in the parameter names. Defaults to the standard "tl". |
features |
Character vector of the set of features you want to stratify
reads by and estimate proportions of each RNA population. The default of "all"
will use all feature columns in the |
populations |
Mutational populations that were analyzed to generate the fractions table to use. For example, this would be "TC" for a standard s4U-based nucleotide recoding experiment. |
fraction_design |
"Design matrix" specifying which RNA populations exist
in your samples. By default, this will be created automatically and will assume
that all combinations of the |
parameter |
Parameter to average across replicates of a given condition.
Has to be |
repeatID |
If multiple |
exactMatch |
If TRUE, then |
feature_lengths |
Table of effective lengths for each feature combination in your data. For example, if your analysis includes features named GF and XF, this should be a data frame with columns GF, XF, and length. |
use_coverage |
If TRUE, normalized read counts will be used to inform kinetic parameter estimates. If FALSE, only fraction news will be used, which will leave some parameters (e.g., synthesis rate) unidentifiable, though has the advantage of avoiding the potential biases induced by normalization problems. |
normalization_repeatID |
For extracting the |
normalization_exactMatch |
For extracting the |
species_to_sf |
List mapping individual RNA species in |
overwrite |
If TRUE and a fractions estimate output already exists that would possess the same metadata (features analyzed, populations analyzed, and fraction_design), then it will get overwritten with the new output. Else, it will be saved as a separate output with the same name + "_#" where "#" is a numerical ID to distinguish the similar outputs. |
When running AverageAndRegularize() to produce input for EZDynamics(), you
must set parameter to logit_fraction_high<muttype> (<muttype> = type of mutation
modeled by EstimateFractions(), e.g., TC). If you have multiple distinct label times,
you must also include the label time (tl of your metadf)
in your regression formula. EZDynamics() models the logit(fraction high <muttype>),
and this will depend on the label time (longer label time = higher fraction), which
is why these two conditions must be met. If you only have a single label time though,
EZDynamics will be able to impute this one value for all samples from your metadf.
You can also include additional interaction terms in your AverageAndRegularize()
model for different experimental conditions in which experiments were conducted,
so that inferred kinetic parameters can be compared across these conditions. Currently,
more complex modeling beyond simple stratification of samples by one or more condition
is not possible with EZDynamics().
For normalization purposes, especially if analyzing pre-mRNA processing dynamics,
you will need to provide AverageAndRegularize() with a table of feature lengths
via the feature_lengths parameter. This will be used in all cases to length
normalize read counts. Even in the case when you are just modeling mature mRNA
dynamics, this is technically necessary for accurate estimation of scale factors.
The first step of EZDynamics() is attempted inference of normalization scale
factors for read counts. If you have scale factors you calculated yourself,
e.g. via specialized spike-in protocols, you can provide these via the scale_factors
parameter. If not, EZDynamics() will try to infer these from the fraction labeled's
in each sample_feature (e.g., in different subcellular compartments). This relies on
having some sample_feature's that are a combination of other sample_feature's. For
example, if analyzing subcellular fractionation data, you may have 1) total RNA; 2)
cytoplasmic RNA; and 3) nuclear RNA. Total RNA = cytoplasmic + nuclear RNA, and thus
the fraction of reads from labeled RNA is a function of the total cytoplasmic and
nuclear fraction labeled's, as well as the relative molecular abundances of cytoplasmic
and nuclear RNA. The latter is precisely the scale factors we need to estimate.
If you do not have sufficient combinations of data to perform this scale factor estimation,
EZDynamics() will only use the fraction labeled's for modeling kinetic parameters.
It can then perform post-hoc normalization to estimate a single synthesis rate constant,
using the downstream rate constants to infer the unknown normalization scale factor necessary
to combine kinetic parameter estimates and read counts to infer this rate constant.
For estimating kinetic parameters, EZDynamics() infers the solution of the linear
system of ODEs specified in your graph matrix input. This is done by representing
the system of equations as a matrix, and deriving the general solution of the levels
of each modeled species of RNA from the eigenvalues and eigenvectors of this matrix.
While this makes EZDynamics() orders of magnitude more efficient than if it had to
numerically infer the solution for each round of optimization, needing to compute
eigenvalues and eigenvectors in each optimization iteration is still non-trivial,
meaning that EZDynamics() may take anywhere from 10s of minutes to a couple hours
to run, depending on how complex your model is and how many distinct set of samples
and experimental conditions you have.
EZbakRData object with an additional "dynamics" table.
##### MODELING CYTOPLASMIC TO NUCLEAR FLOW ### Simulate data and get replicate average fractions estimates simdata <- EZSimulate( nfeatures = 30, ntreatments = 1, mode = "dynamics", label_time = c(1, 3), dynamics_preset = "nuc2cyto" ) ezbdo <- EZbakRData(simdata$cB, simdata$metadf) ezbdo <- EstimateFractions(ezbdo) ezbdo <- AverageAndRegularize(ezbdo, formula_mean = ~tl:compartment - 1, type = "fractions", parameter = "logit_fraction_highTC") ### ODE model: the graph and measured species graph <- matrix(c(0, 1, 0, 0, 0, 2, 3, 0, 0), nrow = 3, ncol = 3, byrow = TRUE) colnames(graph) <- c("0", "N", "C") rownames(graph) <- colnames(graph) modeled_to_measured <- list( nuclear = list(GF ~ N), cytoplasm = list(GF ~ C), total = list(GF ~ C + N) # total RNA is a combination of C and N ) ### Fit model ezbdo <- EZDynamics(ezbdo, graph = graph, sub_features = "GF", grouping_features = "GF", sample_feature = "compartment", modeled_to_measured = ode_models$nuc2cyto$formulas)##### MODELING CYTOPLASMIC TO NUCLEAR FLOW ### Simulate data and get replicate average fractions estimates simdata <- EZSimulate( nfeatures = 30, ntreatments = 1, mode = "dynamics", label_time = c(1, 3), dynamics_preset = "nuc2cyto" ) ezbdo <- EZbakRData(simdata$cB, simdata$metadf) ezbdo <- EstimateFractions(ezbdo) ezbdo <- AverageAndRegularize(ezbdo, formula_mean = ~tl:compartment - 1, type = "fractions", parameter = "logit_fraction_highTC") ### ODE model: the graph and measured species graph <- matrix(c(0, 1, 0, 0, 0, 2, 3, 0, 0), nrow = 3, ncol = 3, byrow = TRUE) colnames(graph) <- c("0", "N", "C") rownames(graph) <- colnames(graph) modeled_to_measured <- list( nuclear = list(GF ~ N), cytoplasm = list(GF ~ C), total = list(GF ~ C + N) # total RNA is a combination of C and N ) ### Fit model ezbdo <- EZDynamics(ezbdo, graph = graph, sub_features = "GF", grouping_features = "GF", sample_feature = "compartment", modeled_to_measured = ode_models$nuc2cyto$formulas)
EZget() returns a table of interest from your EZbakRData object. It is meant
to make it easier to find and access certain analyses, as a single EZbakRData
object may include analyses of different features, kinetic parameters, dynamical
systems models, comparisons, etc.
EZget( obj, type = c("fractions", "kinetics", "readcounts", "averages", "comparisons", "dynamics"), features = NULL, populations = NULL, fraction_design = NULL, isoforms = NULL, kstrat = NULL, parameter = NULL, counttype = NULL, design_factor = NULL, dynamics_design_factors = NULL, scale_factors = NULL, cstrat = NULL, feature_lengths = NULL, experimental = NULL, reference = NULL, param_name = NULL, param_function = NULL, formula_mean = NULL, sd_grouping_factors = NULL, fit_params = NULL, repeatID = NULL, sub_features = NULL, grouping_features = NULL, sample_feature = NULL, modeled_to_measured = NULL, graph = NULL, normalize_by_median = NULL, deconvolved = NULL, returnNameOnly = FALSE, exactMatch = FALSE, alwaysCheck = FALSE )EZget( obj, type = c("fractions", "kinetics", "readcounts", "averages", "comparisons", "dynamics"), features = NULL, populations = NULL, fraction_design = NULL, isoforms = NULL, kstrat = NULL, parameter = NULL, counttype = NULL, design_factor = NULL, dynamics_design_factors = NULL, scale_factors = NULL, cstrat = NULL, feature_lengths = NULL, experimental = NULL, reference = NULL, param_name = NULL, param_function = NULL, formula_mean = NULL, sd_grouping_factors = NULL, fit_params = NULL, repeatID = NULL, sub_features = NULL, grouping_features = NULL, sample_feature = NULL, modeled_to_measured = NULL, graph = NULL, normalize_by_median = NULL, deconvolved = NULL, returnNameOnly = FALSE, exactMatch = FALSE, alwaysCheck = FALSE )
obj |
EZbakRData object |
type |
The class of EZbakR outputs would you like to search through. Equivalent to the name of the list in the EZbakRData object that contains the tables of interest. |
features |
Features that must be present in the table of interest.
If |
populations |
Only relevant if |
fraction_design |
Only relevant if |
isoforms |
If the relevant table is the result of isoform deconvolution |
kstrat |
Only relevant if |
parameter |
Only relevant if |
counttype |
String denoting what type of read count information you are looking for. Current options are "TMM_normalized", "transcript", and "matrix". TO-DO: Not sure this is being used in any way currently... |
design_factor |
design_factor specified in relevant run of |
dynamics_design_factors |
design_factors included in final |
scale_factors |
Sample group scale factors used in |
cstrat |
Strategy used for comparative analyses. Can be:
|
feature_lengths |
Table of feature lengths used for length normalization. |
experimental |
Experimental condition specified in relevant run of |
reference |
Reference condition specified in relevant run of |
param_name |
Parameter name specified in relevant run of |
param_function |
Function of parameters specified in relevant run of |
formula_mean |
An R formula object specifying how the |
sd_grouping_factors |
What metadf columns should data be grouped by when estimating standard deviations across replicates? Therefore, only relevant if type == "averages". |
fit_params |
Character vector of names of parameters in linear model fit. Therefore, only relevant if type == "averages". |
repeatID |
Numerical ID for duplicate objects with same metadata. |
sub_features |
Only relevant if type == "dynamics". Feature columns
that distinguished between the different measured species when running
|
grouping_features |
Only relevant if type == "dynamics. Features
that were the overarching feature assignments by which |
sample_feature |
Only relevant if type == "dynamics". Name of the metadf
column that distinguished the different classes of samples when running
|
modeled_to_measured |
Only relevant if type == "dynamics". Specifies
the relationship between |
graph |
Only relevant if type == "dynamics". NxN adjacency matrix,
where N represents the number of species modeled when running |
normalize_by_median |
Whether median difference was subtracted from estimated kinetic parameter difference. Thus, only relevant if type == "comparisons". |
deconvolved |
Only relevant if type == "fractions". Boolean that is TRUE if
fractions table is result of performing multi-feature deconvolution with
|
returnNameOnly |
If TRUE, then only the names of tables that passed your
search criteria will be returned. Else, the single table passing your search
criteria will be returned. If there is more than one table that passes your
search criteria and |
exactMatch |
If TRUE, then for |
alwaysCheck |
If TRUE, then even if there is only a single table for the |
The input to EZget() is 1) the type of table you want to get ("fractions",
"kinetics", "averages", "comparisons", or "dynamics") and 2) the metadata necessary
to uniquely specify the table of interest. Above, every available piece of metadata
that can be specified for this purpose is documented. You only need to specify the
minimum information necessary. For example, if you would like to get a "fractions"
table from an analysis of exon bins (feature == "exon_bins", and potentially
other overarching features like "XF", "GF", or "rname"), and none of your other
"fractions" tables includes exon_bins as a feature, then you can get this table
with EZget(ezbdo, type = "fractions", features = "exon_bins"), where ezbdo
is your EZbakRData object.
As another example, imagine you want to get a "kinetics" table from an analysis
of gene-wise kinetic parameters (e.g., features == "XF"). You may have multiple
"kinetics" tables, all with "XF" as at least one of their features. If all of the
other tables have additional features though, then you can tell EZget() that
"XF" is the only feature present in your table of interest by setting exactMatch
to TRUE, which tells EZget() that the metadata you specify should exactly match
the relevant metadata for the table of interest. So the call in this case would
look like EZget(ezbdo, type = "fractions", features = "XF", exactMatch = TRUE).
EZget() is used internally in almost every single EZbakR function to specify
the input table for each analysis. Thus, the usage and metadata described here
also applies to all functions that require you to specify which table you want
to use (e.g., EstimateKinetics(), AverageAndRegularize(), CompareParameters(),
etc.).
Table of interest from the relevant EZbakRdata list (set by the
type parameter).
# Simulate data to analyze simdata <- EZSimulate(30) # Create EZbakR input ezbdo <- EZbakRData(simdata$cB, simdata$metadf) # Estimate Fractions ezbdo <- EstimateFractions(ezbdo) # Estimate Kinetics ezbdo <- EstimateKinetics(ezbdo) # Average log(kdeg) estimates across replicate ezbdo <- AverageAndRegularize(ezbdo) #' # Average log(ksyn) estimates across replicate ezbdo <- AverageAndRegularize(ezbdo, parameter = "log_ksyn") # Compare log(kdeg) across conditions ezbdo <- CompareParameters( ezbdo, design_factor = "treatment", reference = "treatment1", experimental = "treatment2" ) # Get the one and only fractions object fxns <- EZget(ezbdo, type = "fractions") # Get the log(ksyn) averages table ksyn_avg <- EZget(ezbdo, type = "averages", parameter = "log_ksyn")# Simulate data to analyze simdata <- EZSimulate(30) # Create EZbakR input ezbdo <- EZbakRData(simdata$cB, simdata$metadf) # Estimate Fractions ezbdo <- EstimateFractions(ezbdo) # Estimate Kinetics ezbdo <- EstimateKinetics(ezbdo) # Average log(kdeg) estimates across replicate ezbdo <- AverageAndRegularize(ezbdo) #' # Average log(ksyn) estimates across replicate ezbdo <- AverageAndRegularize(ezbdo, parameter = "log_ksyn") # Compare log(kdeg) across conditions ezbdo <- CompareParameters( ezbdo, design_factor = "treatment", reference = "treatment1", experimental = "treatment2" ) # Get the one and only fractions object fxns <- EZget(ezbdo, type = "fractions") # Get the log(ksyn) averages table ksyn_avg <- EZget(ezbdo, type = "averages", parameter = "log_ksyn")
Make a plot of effect size (y-axis) vs. log10(read coverage) (x-axis), coloring points by position relative to user-defined decision cutoffs.
EZMAPlot( obj, parameter = "log_kdeg", design_factor = NULL, reference = NULL, experimental = NULL, param_name = NULL, param_function = NULL, features = NULL, condition = NULL, repeatID = NULL, exactMatch = TRUE, plotlog2 = TRUE, FDR_cutoff = 0.05, difference_cutoff = log(2), size = NULL, features_to_highlight = NULL, highlight_shape = 21, highlight_size_diff = 1, highlight_stroke = 0.7, highlight_fill = NA, highlight_color = "black" )EZMAPlot( obj, parameter = "log_kdeg", design_factor = NULL, reference = NULL, experimental = NULL, param_name = NULL, param_function = NULL, features = NULL, condition = NULL, repeatID = NULL, exactMatch = TRUE, plotlog2 = TRUE, FDR_cutoff = 0.05, difference_cutoff = log(2), size = NULL, features_to_highlight = NULL, highlight_shape = 21, highlight_size_diff = 1, highlight_stroke = 0.7, highlight_fill = NA, highlight_color = "black" )
obj |
An object of class |
parameter |
Name of parameter whose comparison you want to plot. |
design_factor |
Name of factor from |
reference |
Name of reference |
experimental |
Name of |
param_name |
If you want to assess the significance of a single parameter, rather than the comparison of two parameters, specify that one parameter's name here. |
param_function |
NOT YET IMPLEMENTED. Will allow you to specify more complicated functions of parameters when hypotheses you need to test are combinations of parameters rather than individual parameters or simple differences in two parameters. |
features |
Character vector of feature names for which comparisons were made. |
condition |
Defunct parameter that has been replaced with |
repeatID |
If multiple |
exactMatch |
If TRUE, then |
plotlog2 |
If TRUE, assume that log(parameter) difference is passed in and that you want to plot log2(parameter) difference. |
FDR_cutoff |
False discovery cutoff by which to color points. |
difference_cutoff |
Minimum absolute difference cutoff by which to color points. |
size |
Size of points, passed to |
features_to_highlight |
Features you want to highlight in the plot (black circle will be drawn around them). This can either be a data frame with one column per feature type in the comparison table you are visualizing, or a vector of feature names if the relevant comparison table will only have one feature type noted. |
highlight_shape |
Shape of points overlayed on highlighted features. Defaults to an open circle |
highlight_size_diff |
Sets how much larger should the points overlayed on the highlighted features be than the original plot points. |
highlight_stroke |
Stroke width of the points overlayed on the highlighted features. |
highlight_fill |
Fill color of the points overlayed on the highlighted
features. Default is for them to be fill-less ( |
highlight_color |
Color of the points overlayed on the highlighted points. |
EZMAPlot() accepts as input the output of CompareParameters(), i.e.,
an EZbakRData object with at least one "comparisons" table. It will plot
the "avg_coverage" column in this table vs. the "difference" column.
In the simplest case, "difference" represents a log-fold change in a kinetic
parameter (e.g., kdeg) estimate. More complicated linear model fits and
comparisons can yield different parameter estimates.
NOTE: some outputs of CompareParameters() are not meant for visualization
via an MA plot. For example, when fitting certain interaction models,
some of the parameter estimates may represent average log(kinetic paramter)
in one condition. See discussion of one example of this here.
EZbakR estimates kinetic parameters in EstimateKinetics() and EZDynamics()
on a log-scale. By default, since log2-fold changes are a bit easier to interpret
and more common for these kind of visualizations, EZMAPlot() multiplies
the y-axis value by log2(exp(1)), which is the factor required to convert from
a log to a log2 scale. You can turn this off by setting plotlog2 to FALSE.
A ggplot2 object. Y-axis = log2(estimate of interest (e.g., fold-change
in degradation rate constant); X-axis = log10(average normalized read coverage);
points colored by location relative to FDR and effect size cutoffs.
# Simulate data to analyze simdata <- EZSimulate(30) # Create EZbakR input ezbdo <- EZbakRData(simdata$cB, simdata$metadf) # Estimate Fractions ezbdo <- EstimateFractions(ezbdo) # Estimate Kinetics ezbdo <- EstimateKinetics(ezbdo) # Average estimates across replicate ezbdo <- AverageAndRegularize(ezbdo) # Compare parameters across conditions ezbdo <- CompareParameters( ezbdo, design_factor = "treatment", reference = "treatment1", experimental = "treatment2" ) # Make MA plot (ggplot object that you can save and add/modify layers) EZMAPlot(ezbdo)# Simulate data to analyze simdata <- EZSimulate(30) # Create EZbakR input ezbdo <- EZbakRData(simdata$cB, simdata$metadf) # Estimate Fractions ezbdo <- EstimateFractions(ezbdo) # Estimate Kinetics ezbdo <- EstimateKinetics(ezbdo) # Average estimates across replicate ezbdo <- AverageAndRegularize(ezbdo) # Compare parameters across conditions ezbdo <- CompareParameters( ezbdo, design_factor = "treatment", reference = "treatment1", experimental = "treatment2" ) # Make MA plot (ggplot object that you can save and add/modify layers) EZMAPlot(ezbdo)
Make a plot of effect size (y-axis) vs. log10(read coverage) (x-axis), coloring points by position relative to user-defined decision cutoffs.
EZpcaPlot( obj, data_type = c("fraction_labeled", "reads"), features = NULL, exactMatch = TRUE, variance_decile = 7, center = TRUE, scale = TRUE, point_size = 3, metadf_cols_to_use = "all" )EZpcaPlot( obj, data_type = c("fraction_labeled", "reads"), features = NULL, exactMatch = TRUE, variance_decile = 7, center = TRUE, scale = TRUE, point_size = 3, metadf_cols_to_use = "all" )
obj |
An object of class |
data_type |
Specifies what data to use for the PCA. Options are "fraction_labeled" (default; means using fraction high T-to-C or other mutation type estimate from EZbakR) or "reads" (means using log10(read counts + 1)). |
features |
Character vector of feature names for which comparisons were made. |
exactMatch |
If TRUE, then |
variance_decile |
Integer between (inclusive) 1 and 9. Features with sample-to-sample
variance greater than the nth decile (n = |
center |
From |
scale |
From |
point_size |
Size of points in PCA plot |
metadf_cols_to_use |
Columns in the EZbakR metadf that will be used to color points in the PCA plot. Points will be colored by the interaction between all of these columns (i.e., samples with unique combinations of values of these columns will get unique colors). Default is to use all columns (except "sample"), specified as "all". |
EZMAPlot() accepts as input the output of CompareParameters(), i.e.,
an EZbakRData object with at least one "comparisons" table. It will plot
the "avg_coverage" column in this table vs. the "difference" column.
In the simplest case, "difference" represents a log-fold change in a kinetic
parameter (e.g., kdeg) estimate. More complicated linear model fits and
comparisons can yield different parameter estimates.
NOTE: some outputs of CompareParameters() are not meant for visualization
via an MA plot. For example, when fitting certain interaction models,
some of the parameter estimates may represent average log(kinetic paramter)
in one condition. See discussion of one example of this here.
EZbakR estimates kinetic parameters in EstimateKinetics() and EZDynamics()
on a log-scale. By default, since log2-fold changes are a bit easier to interpret
and more common for these kind of visualizations, EZMAPlot() multiplies
the y-axis value by log2(exp(1)), which is the factor required to convert from
a log to a log2 scale. You can turn this off by setting plotlog2 to FALSE.
A ggplot2 object.
# Simulate data to analyze simdata <- EZSimulate(30) # Create EZbakR input ezbdo <- EZbakRData(simdata$cB, simdata$metadf) # Estimate Fractions ezbdo <- EstimateFractions(ezbdo) # Make MA plot (ggplot object that you can save and add/modify layers) EZpcaPlot(ezbdo)# Simulate data to analyze simdata <- EZSimulate(30) # Create EZbakR input ezbdo <- EZbakRData(simdata$cB, simdata$metadf) # Estimate Fractions ezbdo <- EstimateFractions(ezbdo) # Make MA plot (ggplot object that you can save and add/modify layers) EZpcaPlot(ezbdo)
EZQC() assesses multiple aspects of your NR-seq data and generates a number
of plots visualizing dataset-wide trends.
EZQC(obj, ...)EZQC(obj, ...)
obj |
EZbakRData or EZbakRFractions object. |
... |
Parameters passed to the class-specific method.
If you have provided an EZbakRFractions object, then these can be (all play the
same role as in
If you have provided an EZbakRData object, then these can be (all same the same
purpose as in
|
EZQC() checks the following aspects of your NR-seq data. If you have passed
an EZbakRData object, then EZQC() checks:
Raw mutation rates: In all sequencing reads, how many T's in the reference were a C in the read? The hope is that raw mutation rates are higher than -label controls in all +label samples. Higher raw mutation rates, especially when using standard label times (e.g., 2 hours or more in mammalian systems), are typically a sign of good label incorporation and low labeled RNA/read dropout. If you don't have -label samples, know that background mutation rates are typically less than 0.2%, so +label raw mutation rates several times higher than this would be preferable.
Mutation rates in labeled and unlabeled reads: The raw mutation rate counts all mutations in all reads. In a standard NR-seq experiment performed with a single metabolic label, there are typically two populations of reads:
Those from labeled RNA, having higher mutation rates due to chemical conversion/recoding
of the metabolic label and 2) those from unlabeled RNA, having lower, background
levels of mutations. EZbakR fits a two component mixture model to estimate the mutation
rates in these two populations separately. A successful NR-seq experiment should
have a labeled read mutation rate of > 1% and a low background mutation rate of
< 0.3%.
Read count replicate correlation: Simply the log10 read count correlation
for replicates, as inferred from your metadf.
If you have passed an EZbakRFractions object, i.e., the output of EstimateFractions(),
then in addition to the checks in the EZbakRData input case, EZQC() also checks:
Fraction labeled distribution: This is the distribution of feature-wise
fraction labeled's (or fraction high mutation content's) estimated by EstimateFractions().
The "ideal" is a distribution with mean around 0.5, as this maximizes the amount of RNA
with synthesis and degradation kinetics within the dynamic range of the experiment. In practice,
you will (and should) be at least a bit lower than this as longer label times risk
physiological impacts of metabolic labeling.
Fraction labeled replicate correlation: This is the logit(fraction labeled)
correlation between replicates, as inferred from your metadf.
A list of ggplot2 objects visualizing the various aspects of your data
assessed by EZQC().
# Simulate data to analyze simdata <- EZSimulate(30) # Create EZbakR input ezbdo <- EZbakRData(simdata$cB, simdata$metadf) # Estimate Fractions ezbdo <- EstimateFractions(ezbdo) # Run QC QC <- EZQC(ezbdo)# Simulate data to analyze simdata <- EZSimulate(30) # Create EZbakR input ezbdo <- EZbakRData(simdata$cB, simdata$metadf) # Estimate Fractions ezbdo <- EstimateFractions(ezbdo) # Run QC QC <- EZQC(ezbdo)
Run quality control checks
## S3 method for class 'EZbakRArrowData' EZQC( obj, mutrate_populations = "all", features = "all", filter_cols = "all", filter_condition = `&`, remove_features = c("NA", "__no_feature"), ... )## S3 method for class 'EZbakRArrowData' EZQC( obj, mutrate_populations = "all", features = "all", filter_cols = "all", filter_condition = `&`, remove_features = c("NA", "__no_feature"), ... )
obj |
An EZbakRData object |
mutrate_populations |
Same as in |
features |
Same as in |
filter_cols |
Same as in |
filter_condition |
Same as in |
remove_features |
Same as in |
... |
Additional arguments. Currently goes unused. |
A list of ggplot2 objects visualizing the various aspects of your data
assessed by EZQC().
Run quality control checks
## S3 method for class 'EZbakRData' EZQC( obj, mutrate_populations = "all", features = "all", filter_cols = "all", filter_condition = `&`, remove_features = c("NA", "__no_feature"), ... )## S3 method for class 'EZbakRData' EZQC( obj, mutrate_populations = "all", features = "all", filter_cols = "all", filter_condition = `&`, remove_features = c("NA", "__no_feature"), ... )
obj |
An EZbakRData object |
mutrate_populations |
Same as in |
features |
Same as in |
filter_cols |
Same as in |
filter_condition |
Same as in |
remove_features |
Same as in |
... |
Additional arguments. Currently goes unused. |
A list of ggplot2 objects visualizing the various aspects of your data
assessed by EZQC().
Run quality control checks
## S3 method for class 'EZbakRFractions' EZQC(obj, features = NULL, populations = NULL, fraction_design = NULL, ...)## S3 method for class 'EZbakRFractions' EZQC(obj, features = NULL, populations = NULL, fraction_design = NULL, ...)
obj |
EZbakRFractions object, which is an EZbakRData object on which
|
features |
Set of features analyzed in the fractions table you are
interested QCing. This gets passed to |
populations |
Set of mutation types analyzed in the fractions table
you are interested in QCing. This gets passed to |
fraction_design |
The fraction "design matrix" specified to get the
fractions table you are interested in QCing. This gets passed to |
... |
Additional arguments. Currently goes unused. |
A list of ggplot2 objects visualizing the various aspects of your data
assessed by EZQC().
EZSimulate() is a user friendly wrapper to SimulateMultiCondition(). It
sets convenient defaults so as to quickly generate easy to interpret output.
EZSimulate() has all of the same parameters as SimulateMultiCondition(),
but it also has a number of additional parameters that guide its default behavior
and allow you to simulate multi-condition data without specifying the multiple,
sometimes complex, arguments that you would need to specify in SimulateMultiCondition()
to get the same behavior. In particular, users only have to set a single parameter,
nfeatures (number of features to simulate data for), by default. The EZSimulate()-unique
parameters ntreatments and nreps have default values that guide the simulation in the
case where only nfeatures is specified. In particular, nreps of ntreatments different
conditions will be simulated, with the assumed model log(kdeg) ~ treatment and log(ksyn) ~ 1.
In other words, Different kdeg values will be simulated for each treatment level, and ksyn
values will not differ across conditions.
EZSimulate( nfeatures, mode = c("standard", "dynamics"), ntreatments = ifelse(mode == "standard", 2, 1), nreps = 3, nctlreps = 1, metadf = NULL, mean_formula = NULL, param_details = NULL, seqdepth = nfeatures * 2500, label_time = 2, pnew = 0.05, pold = 0.001, readlength = 200, Ucont_alpha = 25, Ucont_beta = 75, feature_prefix = "Gene", dispslope = 5, dispint = 0.01, logkdegsdtrend_slope = -0.3, logkdegsdtrend_intercept = -2.25, logksynsdtrend_slope = -0.3, logksynsdtrend_intercept = -2.25, logkdeg_mean = -1.9, logkdeg_sd = 0.7, logksyn_mean = 2.3, logksyn_sd = 0.7, logkdeg_diff_avg = 0, logksyn_diff_avg = 0, logkdeg_diff_sd = 0.5, logksyn_diff_sd = 0.5, pdiff_kd = 0.1, pdiff_ks = 0, pdiff_both = 0, pdo = 0, dynamics_preset = c("preRNA", "nuc2cyto", "preRNAwithPdeg", "nuc2cytowithNdeg", "subtlseq", "nuc2cytowithpreRNA"), unassigned_name = "__no_feature", dispersion = 1000, lfn_sd = 0.2, treatment_effects = NULL, effect_avg_default = 0, effect_sd_default = 0.5, fraction_affected_default = 0.5, log_means = NULL, log_sds = NULL )EZSimulate( nfeatures, mode = c("standard", "dynamics"), ntreatments = ifelse(mode == "standard", 2, 1), nreps = 3, nctlreps = 1, metadf = NULL, mean_formula = NULL, param_details = NULL, seqdepth = nfeatures * 2500, label_time = 2, pnew = 0.05, pold = 0.001, readlength = 200, Ucont_alpha = 25, Ucont_beta = 75, feature_prefix = "Gene", dispslope = 5, dispint = 0.01, logkdegsdtrend_slope = -0.3, logkdegsdtrend_intercept = -2.25, logksynsdtrend_slope = -0.3, logksynsdtrend_intercept = -2.25, logkdeg_mean = -1.9, logkdeg_sd = 0.7, logksyn_mean = 2.3, logksyn_sd = 0.7, logkdeg_diff_avg = 0, logksyn_diff_avg = 0, logkdeg_diff_sd = 0.5, logksyn_diff_sd = 0.5, pdiff_kd = 0.1, pdiff_ks = 0, pdiff_both = 0, pdo = 0, dynamics_preset = c("preRNA", "nuc2cyto", "preRNAwithPdeg", "nuc2cytowithNdeg", "subtlseq", "nuc2cytowithpreRNA"), unassigned_name = "__no_feature", dispersion = 1000, lfn_sd = 0.2, treatment_effects = NULL, effect_avg_default = 0, effect_sd_default = 0.5, fraction_affected_default = 0.5, log_means = NULL, log_sds = NULL )
nfeatures |
Number of "features" (e.g., genes) for which to simulated data. |
mode |
Currently, EZSimulate can simulate in two modes: "standard" and "dynamics".
The former is the default and involves simulating multiple conditions of standard NR-seq data.
"dynamics" calls |
ntreatments |
Number of distinct treatments to simulate. This parameter is
only relevant if |
nreps |
Number of replicates of each treatment to simulate. This parameter is
only relevant if |
nctlreps |
Number of -s4U replicates of each treatment to simulate. This parameter
is only relevant if |
metadf |
A data frame with the following columns:
These parameters (described more below) can also be included in metadf to specify sample-specific simulation parameter:
|
mean_formula |
A formula object that specifies the linear model used to
relate the factors in the |
param_details |
A data frame with one row for each column of the design matrix
obtained from
If param_details is not specified by the user, the first column of the design matrix
is assumed to represent the reference parameter, all parameters are assumed to be
non-global, logkdeg_mean and logksyn_mean are set to the equivalently named parameter values
described below for the reference and |
seqdepth |
Total number of reads in each sample. |
label_time |
Length of s^4^U feed to simulate. |
pnew |
Probability that a T is mutated to a C if a read is new. |
pold |
Probability that a T is mutated to a C if a read is old. |
readlength |
Length of simulated reads. In this simple simulation, all reads are simulated as being exactly this length. |
Ucont_alpha |
Probability that a nucleotide in a simulated read from a given feature
is a U is drawn from a beta distribution with shape1 = |
Ucont_beta |
Probability that a nucleotide in a simulated read from a given feature
is a U is drawn from a beta distribution with shape2 = |
feature_prefix |
Name given to the i-th feature is |
dispslope |
Negative binomial dispersion parameter "slope" with respect to read counts. See DESeq2 paper for dispersion model used. |
dispint |
Negative binomial dispersion parameter "intercept" with respect to read counts. See DESeq2 paper for dispersion model used. |
logkdegsdtrend_slope |
Slope for log10(read count) vs. log(kdeg) replicate variability trend |
logkdegsdtrend_intercept |
Intercept for log10(read count) vs. log(kdeg) replicate variability trend |
logksynsdtrend_slope |
Slope for log10(read count) vs. log(ksyn) replicate variability trend |
logksynsdtrend_intercept |
Intercept for log10(read count) vs. log(ksyn) replicate variability trend |
logkdeg_mean |
Mean of normal distribution from which reference log(kdeg)
linear model parameter is drawn from for each feature if |
logkdeg_sd |
Standard deviation of normal distribution from which reference log(kdeg)
linear model parameter is drawn from for each feature if |
logksyn_mean |
Mean of normal distribution from which reference log(ksyn)
linear model parameter is drawn from for each feature if |
logksyn_sd |
Standard deviation of normal distribution from which reference log(ksyn)
linear model parameter is drawn from for each feature if |
logkdeg_diff_avg |
Mean of normal distribution from which non-reference log(kdeg)
linear model parameters are drawn from for each feature if |
logksyn_diff_avg |
Mean of normal distribution from which reference log(ksyn)
linear model parameter are drawn from for each feature if |
logkdeg_diff_sd |
Standard deviation of normal distribution from which reference log(kdeg)
linear model parameter are drawn from for each feature if |
logksyn_diff_sd |
Standard deviation of normal distribution from which reference log(ksyn)
linear model parameter are drawn from for each feature if |
pdiff_kd |
Proportion of features for which non-reference log(kdeg) linear model parameters differ from the reference. |
pdiff_ks |
Proportion of features for which non-reference log(ksyn) linear model parameters differ from the reference. |
pdiff_both |
Proportion of features for which BOTH non-reference log(kdeg) and log(ksyn) linear model parameters differ from the reference. |
pdo |
Dropout rate; think of this as the probability that a s4U containing
molecule is lost during library preparation and sequencing. If |
dynamics_preset |
Which preset model to use for simulation of dynamics.
Therefore, only relevant if
|
unassigned_name |
String to give to reads not assigned to a given feature. |
dispersion |
Negative binomial |
lfn_sd |
Logit(fn) replicate variability. |
treatment_effects |
Data frame describing effects of treatment on each
parameter. Should have five columns: "parameter_index", "treatment_index", "mean", "sd",
and "fraction_affected".
Each row corresponds to the effect the ith (i = treatment_index) treatment has on the
jth (j = parameter_index) kinetic parameter. Effect sizes, on a log-scale, are drawn from
a Normal distribution with mean and standard deviation set by the mean and sd columns,
respectively. The number of non-zero effects is set by "fraction_affected", and is
equal to |
effect_avg_default |
If |
effect_sd_default |
If |
fraction_affected_default |
If |
log_means |
Vector of log-Normal logmeans from which the distribution of
feature-specific parameters will be drawn from. Length of vector should be the same
as max(entries in |
log_sds |
Vector of log-Normal logsds from which the distribution of feature-specific parameters will be drawn from. If not provided, will be 0.4 for all parameters. |
A list containing 5 elements:
cB: Tibble that can be provided as the cB arg to EZbakRData().
metadf: Tibble that can be provided as the metadf arg to EZbakRData().
PerRepTruth: Tibble containing replicate-by-replicate simulated ground truth
AvgTruth: Tibble containing average simulated ground truth
param_details: Tibble containing information about simulated linear model parameters
# Simulate standard data simdata_standard <- EZSimulate(30) # Simulate dynamical systems data simdata_ode <- EZSimulate(30, mode = "dynamics", ntreatments = 1, label_time = c(1, 3), dynamics_preset = "nuc2cyto")# Simulate standard data simdata_standard <- EZSimulate(30) # Simulate dynamical systems data simdata_ode <- EZSimulate(30, mode = "dynamics", ntreatments = 1, label_time = c(1, 3), dynamics_preset = "nuc2cyto")
Make a plot of effect size (x-axis) vs. multiple-test adjusted p-value (y-axis), coloring points by position relative to user-defined decision cutoffs.
EZVolcanoPlot( obj, parameter = "log_kdeg", design_factor = NULL, reference = NULL, experimental = NULL, param_name = NULL, param_function = NULL, features = NULL, condition = NULL, normalize_by_median = NULL, repeatID = NULL, exactMatch = TRUE, plotlog2 = TRUE, FDR_cutoff = 0.05, difference_cutoff = log(2), size = NULL, features_to_highlight = NULL, highlight_shape = 21, highlight_size_diff = 1, highlight_stroke = 0.7, highlight_fill = NA, highlight_color = "black" )EZVolcanoPlot( obj, parameter = "log_kdeg", design_factor = NULL, reference = NULL, experimental = NULL, param_name = NULL, param_function = NULL, features = NULL, condition = NULL, normalize_by_median = NULL, repeatID = NULL, exactMatch = TRUE, plotlog2 = TRUE, FDR_cutoff = 0.05, difference_cutoff = log(2), size = NULL, features_to_highlight = NULL, highlight_shape = 21, highlight_size_diff = 1, highlight_stroke = 0.7, highlight_fill = NA, highlight_color = "black" )
obj |
An object of class |
parameter |
Name of parameter whose comparison you want to plot. |
design_factor |
Name of factor from |
reference |
Name of reference |
experimental |
Name of |
param_name |
If you want to assess the significance of a single parameter, rather than the comparison of two parameters, specify that one parameter's name here. |
param_function |
NOT YET IMPLEMENTED. Will allow you to specify more complicated functions of parameters when hypotheses you need to test are combinations of parameters rather than individual parameters or simple differences in two parameters. |
features |
Character vector of feature names for which comparisons were made. |
condition |
Defunct parameter that has been replaced with |
normalize_by_median |
Whether or not the median was subtracted from the estimated parameter differences. |
repeatID |
If multiple |
exactMatch |
If TRUE, then |
plotlog2 |
If TRUE, assume that log(parameter) difference is passed in and that you want to plot log2(parameter) difference. TO-DO: probably best to change this to a more general scale parameter by which the parameter is multiplied. Default would be log2(exp(1)) to convert log() to log2(). |
FDR_cutoff |
False discovery cutoff by which to color points. |
difference_cutoff |
Minimum absolute difference cutoff by which to color points. |
size |
Size of points, passed to |
features_to_highlight |
Features you want to highlight in the plot (black circle will be drawn around them). This can either be a data frame with one column per feature type in the comparison table you are visualizing, or a vector of feature names if the relevant comparison table will only have one feature type noted. |
highlight_shape |
Shape of points overlayed on highlighted features. Defaults to an open circle |
highlight_size_diff |
Sets how much larger should the points overlayed on the highlighted features be than the original plot points. |
highlight_stroke |
Stroke width of the points overlayed on the highlighted features. |
highlight_fill |
Fill color of the points overlayed on the highlighted
features. Default is for them to be fill-less ( |
highlight_color |
Stroke color of the points overlayed on the highlighted points. |
EZVolcanoPlot() accepts as input the output of CompareParameters(), i.e.,
an EZbakRData object with at least one "comparisons" table. It will plot
the "difference" column in this table versus -log10 of the "padj" column.
In the simplest case, "difference" represents a log-fold change in a kinetic
parameter (e.g., kdeg) estimate. More complicated linear model fits and
comparisons can yield different parameter estimates.
NOTE: some outputs of CompareParameters() are not meant for visualization
via a volcano plot. For example, when fitting certain interaction models,
some of the parameter estimates may represent average log(kinetic paramter)
in one condition. See discussion of one example of this here.
EZbakR estimates kinetic parameters in EstimateKinetics() and EZDynamics()
on a log-scale. By default, since log2-fold changes are a bit easier to interpret
and more common for these kind of visualizations, EZVolcanoPlot() multiplies
the x-axis value by log2(exp(1)), which is the factor required to convert from
a log to a log2 scale. You can turn this off by setting plotlog2 to FALSE.
A ggplot2 object. X-axis = log2(estimate of interest (e.g., fold-change
in degradation rate constant); Y-axis = -log10(multiple test adjusted p-value);
points colored by location relative to FDR and effect size cutoffs.
# Simulate data to analyze simdata <- EZSimulate(30) # Create EZbakR input ezbdo <- EZbakRData(simdata$cB, simdata$metadf) # Estimate Fractions ezbdo <- EstimateFractions(ezbdo) # Estimate Kinetics ezbdo <- EstimateKinetics(ezbdo) # Average estimates across replicate ezbdo <- AverageAndRegularize(ezbdo) # Compare parameters across conditions ezbdo <- CompareParameters( ezbdo, design_factor = "treatment", reference = "treatment1", experimental = "treatment2" ) # Make volcano plot (ggplot object that you can save and add/modify layers) EZVolcanoPlot(ezbdo)# Simulate data to analyze simdata <- EZSimulate(30) # Create EZbakR input ezbdo <- EZbakRData(simdata$cB, simdata$metadf) # Estimate Fractions ezbdo <- EstimateFractions(ezbdo) # Estimate Kinetics ezbdo <- EstimateKinetics(ezbdo) # Average estimates across replicate ezbdo <- AverageAndRegularize(ezbdo) # Compare parameters across conditions ezbdo <- CompareParameters( ezbdo, design_factor = "treatment", reference = "treatment1", experimental = "treatment2" ) # Make volcano plot (ggplot object that you can save and add/modify layers) EZVolcanoPlot(ezbdo)
EZbakRFractions object.Uses TMM normalization strategy, similar to that used by DESeq2 and edgeR.
get_normalized_read_counts( obj, features_to_analyze, fractions_name = NULL, feature_lengths = NULL, scale_factors = NULL ) ## S3 method for class 'EZbakRFractions' get_normalized_read_counts( obj, features_to_analyze, fractions_name = NULL, feature_lengths = NULL, scale_factors = NULL ) ## S3 method for class 'EZbakRData' get_normalized_read_counts( obj, features_to_analyze, fractions_name = NULL, feature_lengths = NULL, scale_factors = NULL )get_normalized_read_counts( obj, features_to_analyze, fractions_name = NULL, feature_lengths = NULL, scale_factors = NULL ) ## S3 method for class 'EZbakRFractions' get_normalized_read_counts( obj, features_to_analyze, fractions_name = NULL, feature_lengths = NULL, scale_factors = NULL ) ## S3 method for class 'EZbakRData' get_normalized_read_counts( obj, features_to_analyze, fractions_name = NULL, feature_lengths = NULL, scale_factors = NULL )
obj |
An |
features_to_analyze |
Features in relevant table |
fractions_name |
Name of fractions table to use |
feature_lengths |
Table of effective lengths for each feature combination in your data. For example, if your analysis includes features named GF and XF, this should be a data frame with columns GF, XF, and length. |
scale_factors |
Dataframe with two columns, one being "sample" (sample names) and the other being "scale_factor" (value to divide read counts by to normalize them) |
Data table of normalized read counts.
get_normalized_read_counts(EZbakRFractions): Method for class EZbakRFractions
Get normalized read counts from fractions table.
get_normalized_read_counts(EZbakRData): Method for class EZbakRData
Get normalized read counts from a cB table.
# Simulate data simdata <- EZSimulate(30) # Create EZbakRData object ezbdo <- EZbakRData(simdata$cB, simdata$metadf) # Get normalized read counts reads <- get_normalized_read_counts(ezbdo, features_to_analyze = "feature")# Simulate data simdata <- EZSimulate(30) # Create EZbakRData object ezbdo <- EZbakRData(simdata$cB, simdata$metadf) # Get normalized read counts reads <- get_normalized_read_counts(ezbdo, features_to_analyze = "feature")
A convenient wrapper to tximport() for importing isoform quantification
data into an EZbakRData object. You need to run this before running
EstimateIsoformFractions.
ImportIsoformQuant( obj, files, quant_tool = c("none", "salmon", "sailfish", "alevin", "piscem", "kallisto", "rsem", "stringtie"), txIn = TRUE, ... )ImportIsoformQuant( obj, files, quant_tool = c("none", "salmon", "sailfish", "alevin", "piscem", "kallisto", "rsem", "stringtie"), txIn = TRUE, ... )
obj |
An |
files |
A named vector of paths to all transcript quantification files that you would like
to import. This will be passed as the first argument of |
quant_tool |
String denoting the type of software used to generate the abundances. Will
get passed to the |
txIn |
Whether or now you are providing isoform level quantification files.
Alternative ( |
... |
Additional arguments to be passed to |
An EZbakRData object with an additional element in the readcounts
list named "isform_quant_<quant_tool>". It contains TPM, expected_count,
and effective length information for each transcript_id and each sample.
# Dependencies for example library(dplyr) library(data.table) # Simulate and analyze data simdata <- EZSimulate(30) ezbdo <- EZbakRData(simdata$cB, simdata$metadf) ezbdo <- EstimateFractions(ezbdo) # Hack to generate example quantification files savedir <- tempdir() rsem_data <- tibble( transcript_id = paste0("tscript_feature", 1:30), gene_id = paste0("feature", 1:30), length = 1000, effective_length = 1000, expected_count = 1000, TPM = 10, FPKM = 10, IsoPct = 1 ) fwrite(rsem_data, file.path(savedir, "Sample_1.isoforms.results"), sep = '\t') files <- file.path(savedir,"Sample_1.isoforms.results") names(files) <- "Sample_1" # Read in file ezbdo <- ImportIsoformQuant(ezbdo, files, quant_tool = "rsem")# Dependencies for example library(dplyr) library(data.table) # Simulate and analyze data simdata <- EZSimulate(30) ezbdo <- EZbakRData(simdata$cB, simdata$metadf) ezbdo <- EstimateFractions(ezbdo) # Hack to generate example quantification files savedir <- tempdir() rsem_data <- tibble( transcript_id = paste0("tscript_feature", 1:30), gene_id = paste0("feature", 1:30), length = 1000, effective_length = 1000, expected_count = 1000, TPM = 10, FPKM = 10, IsoPct = 1 ) fwrite(rsem_data, file.path(savedir, "Sample_1.isoforms.results"), sep = '\t') files <- file.path(savedir,"Sample_1.isoforms.results") names(files) <- "Sample_1" # Read in file ezbdo <- ImportIsoformQuant(ezbdo, files, quant_tool = "rsem")
EZbakRArrowData object constructor for internal usenew_EZbakRArrowData efficiently creates an object of class EZbakRArrowData.
It does not perform any rigorous checks of the legitimacy of this object.
new_EZbakRArrowData(cBds, metadf)new_EZbakRArrowData(cBds, metadf)
cBds |
Arrow dataset tracking the sample ID, mutational and nucleotide content, and feature assignment of sequencing reads. |
metadf |
Data frame tracking features of each of the samples included
in |
EZbakRDataobject constructor for internal usenew_EZbakRData efficiently creates an object of class EZbakRData.
It does not perform any rigorous checks of the legitimacy of this object.
new_EZbakRData(cB, metadf)new_EZbakRData(cB, metadf)
cB |
Data frame tracking the sample ID, mutational and nucleotide content, and feature assignment of sequencing reads. |
metadf |
Data frame tracking features of each of the samples included
in |
EZbakRFractions object constructornew_EZbakRFractions efficiently creates an object of class EZbakRFractions.
It does not perform any rigorous checks of the legitimacy of this object.
new_EZbakRFractions(fractions, metadf, name = NULL, character_limit = 20)new_EZbakRFractions(fractions, metadf, name = NULL, character_limit = 20)
fractions |
Data frame containing information about the fraction of reads from each mutational population of interest. |
metadf |
Data frame reporting aspects of each of the samples included |
name |
Optional; name to give to fractions table. |
character_limit |
Maximum number of characters for naming out fractions output. EZbakR
will try to name this as a "_" separated character vector of all of the features analyzed.
If this name is greater than |
EZbakRKinetics object constructornew_EZbakRKinetics efficiently creates an object of class EZbakRKinetics.
It does not perform any rigorous checks of the legitimacy of this object.
new_EZbakRKinetics( kinetics, features, metadf, name = NULL, character_limit = 20 )new_EZbakRKinetics( kinetics, features, metadf, name = NULL, character_limit = 20 )
kinetics |
Data frame containing information about the kinetic parameters of interest for each set of features tracked. |
features |
Features tracked in |
metadf |
Data frame describing each of the samples included |
name |
Optional; name to give to fractions table. |
character_limit |
Maximum number of characters for naming out fractions output. EZbakR
will try to name this as a "_" separated character vector of all of the features analyzed.
If this name is greater than |
Uses the strategy described here, and similar to that originally presented in Berg et al. 2024, to normalize for dropout. Normalizing for dropout means identifying a reference sample with low dropout, and estimating dropout in each sample relative to that sample.
NormalizeForDropout( obj, normalize_across_tls = FALSE, grouping_factors = NULL, features = NULL, populations = NULL, fraction_design = NULL, repeatID = NULL, exactMatch = TRUE, read_cutoff = 25 )NormalizeForDropout( obj, normalize_across_tls = FALSE, grouping_factors = NULL, features = NULL, populations = NULL, fraction_design = NULL, repeatID = NULL, exactMatch = TRUE, read_cutoff = 25 )
obj |
An EZbakRFractions object, which is an EZbakRData object on which
you have run |
normalize_across_tls |
If TRUE, samples from different label times will be normalized by finding a max inferred degradation rate constant (kdeg) sample and using that as a reference. Degradation kinetics with this max will be assumed to infer reference fraction news at different label times |
grouping_factors |
Which sample-detail columns in the metadf should be used
to group -s4U samples by for calculating the average -s4U RPM? The default value of
|
features |
Character vector of the set of features you want to stratify
reads by and estimate proportions of each RNA population. The default of |
populations |
Mutational populations that were analyzed to generate the fractions table to use. For example, this would be "TC" for a standard s4U-based nucleotide recoding experiment. |
fraction_design |
"Design matrix" specifying which RNA populations exist
in your samples. By default, this will be created automatically and will assume
that all combinations of the |
repeatID |
If multiple |
exactMatch |
If TRUE, then |
read_cutoff |
Minimum number of reads for a feature to be used to fit dropout model. |
NormalizeForDropout() has a number of unique advantages relative to
CorrectDropout():
NormalizeForDropout() doesn't require -label control data.
NormalizeForDropout() compares an internally normalized quantity
(fraction new) across samples, which has some advantages over the
absolute dropout estimates derived from comparisons of normalized read
counts in CorrectDropout().
NormalizeForDropout() may be used to normalize half-life estimates
across very different biological contexts (e.g., different cell types).
There are also some caveats to be aware of when using NormalizeForDropout():
Be careful using NormalizeForDropout() when you have multiple different
label times. Dropout normalization requires each sample be compared to a reference
sample with the same label time. Thus, normalization will be performed
separately for groups of samples with different label times. If the extent
of dropout in the references with different label times is different, there
will still be unaccounted for dropout biases between some of the samples.
NormalizeForDropout() effectively assumes that there are no true global
differences in turnover kinetics of RNA. If such differences actually exist
(e.g., half-lives in one context are on average truly lower than those in
another), NormalizeForDropout() risks normalizing away these real
differences. This is similar to how statistical normalization strategies
implemented in differential expression analysis software like DESeq2 assumes
that there are no global differences in RNA levels.
By default, all samples with same label time are normalized with respect
to a reference sample chosen from among them. If you want to further separate
the groups of samples that are normalized together, specify the columns of
your metadf by which you want to additionally group factors in the grouping_factors
parameter. This behavior can be changed by setting normalize_across_tls to
TRUE, which will
An EZbakRData object with the specified "fractions" table replaced
with a dropout corrected table.
# Simulate data to analyze simdata <- EZSimulate(30) # Create EZbakR input ezbdo <- EZbakRData(simdata$cB, simdata$metadf) # Estimate Fractions ezbdo <- EstimateFractions(ezbdo) # Normalize for dropout ezbdo <- NormalizeForDropout(ezbdo)# Simulate data to analyze simdata <- EZSimulate(30) # Create EZbakR input ezbdo <- EZbakRData(simdata$cB, simdata$metadf) # Estimate Fractions ezbdo <- EstimateFractions(ezbdo) # Normalize for dropout ezbdo <- NormalizeForDropout(ezbdo)
A list of example "graphs" and specie formulas that can be passed to
EZDynamics or SimulateDynamics, and that are used under the hood
by EZSimulate to facilitate simulations of ODE models.
ode_modelsode_models
ode_modelsA list with 6 elements
Simplest model of nuclear and cytoplasmic RNA dynamics: 0 -> N -> C -> 0
Simplest model of pre-RNA and mature RNA dynamics: 0 -> P -> M -> 0
Same as preRNA, but now pre-RNA can also degrade.
Same as nuc2cyto, but now nuclear RNA can also degrade.
Subcellular TimeLapse-seq model, similar to that described in Ietswaart et al., 2024. Simplest model discussed there, lacking nuclear degradation: 0 -> CH -> NP -> CY -> PL -> 0, and CY can also degrade.
Combination of nuc2cyto and preRNA where preRNA is first synthesized, then either processed or exported to the cytoplasm. Processing can also occur in the cytoplasm, and mature nuclear RNA can be exported to the cytoplasm. Only mature RNA degrades.
Each element of list has two items
Matrix representation of ODE system graph.
Formula objects relating measured species to modeled species.
Ietswaart et al. (2024) Molecular Cell. 84(14), 2765-2784.
EZbakRArrowData objectsPrint method for EZbakRArrowData objects
## S3 method for class 'EZbakRArrowData' print(x, max_name_chars = 60, ...)## S3 method for class 'EZbakRArrowData' print(x, max_name_chars = 60, ...)
x |
An |
max_name_chars |
Maximum number of characters to print on each line |
... |
Ignored |
The input EZbakRData object, invisibly
# Simulate data to analyze simdata <- SimulateOneRep(30) # Create EZbakR input metadf <- data.frame(sample = "sampleA", tl = 2) ezbdo <- EZbakRData(simdata$cB, metadf) # Print print(ezbdo)# Simulate data to analyze simdata <- SimulateOneRep(30) # Create EZbakR input metadf <- data.frame(sample = "sampleA", tl = 2) ezbdo <- EZbakRData(simdata$cB, metadf) # Print print(ezbdo)
EZbakRData objectsPrint method for EZbakRData objects
## S3 method for class 'EZbakRData' print(x, max_name_chars = 60, ...)## S3 method for class 'EZbakRData' print(x, max_name_chars = 60, ...)
x |
An |
max_name_chars |
Maximum number of characters to print on each line |
... |
Ignored |
The input EZbakRData object, invisibly
# Simulate data to analyze simdata <- SimulateOneRep(30) # Create EZbakR input metadf <- data.frame(sample = "sampleA", tl = 2) ezbdo <- EZbakRData(simdata$cB, metadf) # Print print(ezbdo)# Simulate data to analyze simdata <- SimulateOneRep(30) # Create EZbakR input metadf <- data.frame(sample = "sampleA", tl = 2) ezbdo <- EZbakRData(simdata$cB, metadf) # Print print(ezbdo)
Simple simulation function
SimpleSim( nreads = 1000, fn = 0.5, pnew = 0.05, pold = 0.001, rlen = 100, Ucont = 0.25 )SimpleSim( nreads = 1000, fn = 0.5, pnew = 0.05, pold = 0.001, rlen = 100, Ucont = 0.25 )
nreads |
Number of reads to simulate |
fn |
Fraction of reads that are new in simulation. Whether a read will be new will be determined by a draw from a Bernoulli(fn) distribution. |
pnew |
T-to-C mutation rate in new reads |
pold |
T-to-C mutation rate in old reads |
rlen |
Length of simulated reads |
Ucont |
Fraction of nucleotides in simulated reads that are Ts (U in RNA) |
Tibble with 3 columns:
nT: Simulated number of Ts
TC: Simulated number of T-to-C mutations
n: Number of simulated reads with nT Ts and TC mutations.
# Simulate 1 gene worth of data data simdata <- SimpleSim()# Simulate 1 gene worth of data data simdata <- SimpleSim()
SimulateDynamics() simulates any specified dynamical system of interconverting
RNA species. Its required input is similar to that of EstimateDynamics(), i.e.,
an adjacency matrix describing the set of species and how they are related to
one another and a list of formula relating actually assayed species to the
modeled species. Currently, SimulateDynamics() implements a naive heteroskedastic
replicate variability simulation and is not designed to simulate multiple experimental
conditions.
SimulateDynamics( nfeatures, graph, metadf, log_means, log_sds, ntreatments = 1, treatment_effects = NULL, formula_list = NULL, unassigned_name = "__no_feature", seqdepth = nfeatures * 2500, dispersion = 100, lfn_sd = 0.2, effect_avg_default = 0, effect_sd_default = 0.5, fraction_affected_default = 0.5, ... )SimulateDynamics( nfeatures, graph, metadf, log_means, log_sds, ntreatments = 1, treatment_effects = NULL, formula_list = NULL, unassigned_name = "__no_feature", seqdepth = nfeatures * 2500, dispersion = 100, lfn_sd = 0.2, effect_avg_default = 0, effect_sd_default = 0.5, fraction_affected_default = 0.5, ... )
nfeatures |
Number of "features" to simulate data for. A "feature" in this case may contain a number of "sub-features". For example, you may want to simulate pre-RNA and mature RNA for a set of "genes", in which case the number of features is the number of genes. |
graph |
An adjacency matrix describing the reaction diagram graph relating the various RNA species to one another. |
metadf |
Data frame with two required columns ( |
log_means |
Vector of log-Normal logmeans from which the distribution of
feature-specific parameters will be drawn from. Length of vector should be the same
as max(entries in |
log_sds |
Vector of log-Normal logsds from which the distribution of feature-specific parameters will be drawn from. |
ntreatments |
Number of distinct experimental treatments to simulate. By default, only a single "treatment" (you might refer to this as wild-type, or control) is simulated. Increase this if you would like to explore performing comparative dynamical systems modeling. |
treatment_effects |
Data frame describing effects of treatment on each
parameter. Should have five columns: "parameter_index", "treatment_index", "mean", "sd",
and "fraction_affected".
Each row corresponds to the effect the ith (i = treatment_index) treatment has on the
jth (j = parameter_index) kinetic parameter. Effect sizes, on a log-scale, are drawn from
a Normal distribution with mean and standard deviation set by the mean and sd columns,
respectively. The number of non-zero effects is set by "fraction_affected", and is
equal to |
formula_list |
A list of named lists. The names of each sub-list should be
the same as the sample names as they are found in |
unassigned_name |
String to give to reads not assigned to a given feature. |
seqdepth |
Total number of reads in each sample. |
dispersion |
Negative binomial |
lfn_sd |
Logit(fn) replicate variability. |
effect_avg_default |
If |
effect_sd_default |
If |
fraction_affected_default |
If |
... |
Parameters passed to |
SimulateIsoforms() performs a simple simulation of isoform-specific kinetic
parameters to showcase and test EstimateIsoformFractions(). It assumes that
there are a set of reads (fraction of total set by funique parameter) which
map uniquely to a given isoform, while the rest are ambiguous to all isoforms
from that gene. Mutational content of these reads are simulated as in
SimulateOneRep().
SimulateIsoforms( nfeatures, nt = NULL, seqdepth = nfeatures * 2500, label_time = 4, sample_name = "sampleA", feature_prefix = "Gene", pnew = 0.1, pold = 0.002, funique = 0.2, readlength = 200, Ucont = 0.25, avg_numiso = 2, psynthdiff = 0.5, logkdeg_mean = -1.9, logkdeg_sd = 0.7, logksyn_mean = 2.3, logksyn_sd = 0.7 )SimulateIsoforms( nfeatures, nt = NULL, seqdepth = nfeatures * 2500, label_time = 4, sample_name = "sampleA", feature_prefix = "Gene", pnew = 0.1, pold = 0.002, funique = 0.2, readlength = 200, Ucont = 0.25, avg_numiso = 2, psynthdiff = 0.5, logkdeg_mean = -1.9, logkdeg_sd = 0.7, logksyn_mean = 2.3, logksyn_sd = 0.7 )
nfeatures |
Number of "features" to simulate data for. Each feature will have a simulated number of transcript isoforms |
nt |
(Optional), can provide a vector of the number of isoforms you would
like to simulate for each of the |
seqdepth |
Total number of sequencing reads to simulate |
label_time |
Length of s^4^U feed to simulate. |
sample_name |
Character vector to assign to |
feature_prefix |
Name given to the i-th feature is |
pnew |
Probability that a T is mutated to a C if a read is new. |
pold |
Probability that a T is mutated to a C if a read is old. |
funique |
Fraction of reads that uniquely "map" to a single isoform. |
readlength |
Length of simulated reads. In this simple simulation, all reads are simulated as being exactly this length. |
Ucont |
Percentage of nucleotides simulated to be U's. |
avg_numiso |
Average number of isoforms for each feature. Feature-specific
isoform counts are drawn from a Poisson distribution with this average. NOTE:
to insure that all features have multiple isoforms, the simulated number of
isoforms drawn from a Poisson distribution is incremented by 2. Thus, the
actual average number of isoforms from each feature is |
psynthdiff |
Percentage of genes for which all isoform abundance differences are synthesis driven. If not synthesis driven, then isoform abundance differences will be driven by differences in isoform kdegs. |
logkdeg_mean |
meanlog of a log-normal distribution from which kdegs are simulated |
logkdeg_sd |
sdlog of a log-normal distribution from which kdegs are simulated |
logksyn_mean |
meanlog of a log-normal distribution from which ksyns are simulated |
logksyn_sd |
sdlog of a log-normal distribution from which ksyns are simulated |
List with two elements:
cB: Tibble that can be passed as the cB arg to EZbakRData().
ground_truth: Tibble containing simulated ground truth.
simdata <- SimulateIsoforms(30)simdata <- SimulateIsoforms(30)
SimulateMultiCondition is a highly flexibly simulator that combines linear modeling
of log(kdeg)'s and log(ksyn)'s with SimulateOneRep to simulate an NR-seq dataset. The linear model
allows you to simulate multiple distinct treatments, batch effects, interaction effects,
etc. The current downside for its flexibility is its relative complexity to implement.
Easier to use simulators are on the way to EZbakR.
SimulateMultiCondition( nfeatures, metadf, mean_formula, param_details = NULL, seqdepth = nfeatures * 2500, label_time = 2, pnew = 0.05, pold = 0.001, readlength = 200, Ucont_alpha = 25, Ucont_beta = 75, feature_prefix = "Gene", dispslope = 5, dispint = 0.01, logkdegsdtrend_slope = -0.3, logkdegsdtrend_intercept = -2.25, logksynsdtrend_slope = -0.3, logksynsdtrend_intercept = -2.25, logkdeg_mean = -1.9, logkdeg_sd = 0.7, logksyn_mean = 2.3, logksyn_sd = 0.7, logkdeg_diff_avg = 0, logksyn_diff_avg = 0, logkdeg_diff_sd = 0.5, logksyn_diff_sd = 0.5, pdiff_kd = 0.1, pdiff_ks = 0, pdiff_both = 0, pdo = 0 )SimulateMultiCondition( nfeatures, metadf, mean_formula, param_details = NULL, seqdepth = nfeatures * 2500, label_time = 2, pnew = 0.05, pold = 0.001, readlength = 200, Ucont_alpha = 25, Ucont_beta = 75, feature_prefix = "Gene", dispslope = 5, dispint = 0.01, logkdegsdtrend_slope = -0.3, logkdegsdtrend_intercept = -2.25, logksynsdtrend_slope = -0.3, logksynsdtrend_intercept = -2.25, logkdeg_mean = -1.9, logkdeg_sd = 0.7, logksyn_mean = 2.3, logksyn_sd = 0.7, logkdeg_diff_avg = 0, logksyn_diff_avg = 0, logkdeg_diff_sd = 0.5, logksyn_diff_sd = 0.5, pdiff_kd = 0.1, pdiff_ks = 0, pdiff_both = 0, pdo = 0 )
nfeatures |
Number of "features" (e.g., genes) to simulate data for |
metadf |
A data frame with the following columns:
These parameters (described more below) can also be included in metadf to specify sample-specific simulation parameter:
|
mean_formula |
A formula object that specifies the linear model used to
relate the factors in the |
param_details |
A data frame with one row for each column of the design matrix
obtained from
If param_details is not specified by the user, the first column of the design matrix
is assumed to represent the reference parameter, all parameters are assumed to be
non-global, logkdeg_mean and logksyn_mean are set to the equivalently named parameter values
described below for the reference and |
seqdepth |
Only relevant if |
label_time |
Length of s^4^U feed to simulate. |
pnew |
Probability that a T is mutated to a C if a read is new. |
pold |
Probability that a T is mutated to a C if a read is old. |
readlength |
Length of simulated reads. In this simple simulation, all reads are simulated as being exactly this length. |
Ucont_alpha |
Probability that a nucleotide in a simulated read from a given feature
is a U is drawn from a beta distribution with shape1 = |
Ucont_beta |
Probability that a nucleotide in a simulated read from a given feature
is a U is drawn from a beta distribution with shape2 = |
feature_prefix |
Name given to the i-th feature is |
dispslope |
Negative binomial dispersion parameter "slope" with respect to read counts. See DESeq2 paper for dispersion model used. |
dispint |
Negative binomial dispersion parameter "intercept" with respect to read counts. See DESeq2 paper for dispersion model used. |
logkdegsdtrend_slope |
Slope for log10(read count) vs. log(kdeg) replicate variability trend |
logkdegsdtrend_intercept |
Intercept for log10(read count) vs. log(kdeg) replicate variability trend |
logksynsdtrend_slope |
Slope for log10(read count) vs. log(ksyn) replicate variability trend |
logksynsdtrend_intercept |
Intercept for log10(read count) vs. log(ksyn) replicate variability trend |
logkdeg_mean |
Mean of normal distribution from which reference log(kdeg)
linear model parameter is drawn from for each feature if |
logkdeg_sd |
Standard deviation of normal distribution from which reference log(kdeg)
linear model parameter is drawn from for each feature if |
logksyn_mean |
Mean of normal distribution from which reference log(ksyn)
linear model parameter is drawn from for each feature if |
logksyn_sd |
Standard deviation of normal distribution from which reference log(ksyn)
linear model parameter is drawn from for each feature if |
logkdeg_diff_avg |
Mean of normal distribution from which non-reference log(kdeg)
linear model parameters are drawn from for each feature if |
logksyn_diff_avg |
Mean of normal distribution from which reference log(ksyn)
linear model parameter are drawn from for each feature if |
logkdeg_diff_sd |
Standard deviation of normal distribution from which reference log(kdeg)
linear model parameter are drawn from for each feature if |
logksyn_diff_sd |
Standard deviation of normal distribution from which reference log(ksyn)
linear model parameter are drawn from for each feature if |
pdiff_kd |
Proportion of features for which non-reference log(kdeg) linear model parameters differ from the reference. |
pdiff_ks |
Proportion of features for which non-reference log(ksyn) linear model parameters differ from the reference. |
pdiff_both |
Proportion of features for which BOTH non-reference log(kdeg) and log(ksyn) linear model parameters differ from the reference. ksyns are simulated |
pdo |
Dropout rate; think of this as the probability that a s4U containing
molecule is lost during library preparation and sequencing. If |
A list containing 6 elements:
cB: Tibble that can be provided as the cB arg to EZbakRData().
metadf: Tibble that can be provided as the metadf arg to EZbakRData().
PerRepTruth: Tibble containing replicate-by-replicate simulated ground truth
AvgTruth: Tibble containing average simulated ground truth
param_details: Tibble containing information about simulated linear model parameters
UnbiasedFractions: Tibble containing no dropout ground truth
simdata <- SimulateMultiCondition(30, data.frame(sample = c('sampleA', 'sampleB'), treatment = c('treatment1', 'treatment2')), mean_formula = ~treatment-1)simdata <- SimulateMultiCondition(30, data.frame(sample = c('sampleA', 'sampleB'), treatment = c('treatment1', 'treatment2')), mean_formula = ~treatment-1)
Generalizes SimulateOneRep() to simulate any combination of mutation types. Currently, no kinetic model is used to relate certain parameters to the fractions of reads belonging to each simulated mutational population. Instead these fractions are drawn from a Dirichlet distribution with gene-specific parameters.
SimulateMultiLabel( nfeatures, populations = c("TC"), fraction_design = create_fraction_design(populations), fractions_matrix = NULL, read_vect = NULL, sample_name = "sampleA", feature_prefix = "Gene", kdeg_vect = NULL, ksyn_vect = NULL, logkdeg_mean = -1.9, logkdeg_sd = 0.7, logksyn_mean = 2.3, logksyn_sd = 0.7, phighs = stats::setNames(rep(0.05, times = length(populations)), populations), plows = stats::setNames(rep(0.002, times = length(populations)), populations), seqdepth = nfeatures * 2500, readlength = 200, alpha_min = 3, alpha_max = 6, Ucont = 0.25, Acont = 0.25, Gcont = 0.25, Ccont = 0.25 )SimulateMultiLabel( nfeatures, populations = c("TC"), fraction_design = create_fraction_design(populations), fractions_matrix = NULL, read_vect = NULL, sample_name = "sampleA", feature_prefix = "Gene", kdeg_vect = NULL, ksyn_vect = NULL, logkdeg_mean = -1.9, logkdeg_sd = 0.7, logksyn_mean = 2.3, logksyn_sd = 0.7, phighs = stats::setNames(rep(0.05, times = length(populations)), populations), plows = stats::setNames(rep(0.002, times = length(populations)), populations), seqdepth = nfeatures * 2500, readlength = 200, alpha_min = 3, alpha_max = 6, Ucont = 0.25, Acont = 0.25, Gcont = 0.25, Ccont = 0.25 )
nfeatures |
Number of "features" (e.g., genes) to simulate data for |
populations |
Vector of mutation populations you want to simulate. |
fraction_design |
Fraction design matrix, specifying which potential mutational populations should actually exist. See ?EstimateFractions for more details. |
fractions_matrix |
Matrix of fractions of each mutational population to simulate. If not provided, this will be simulated. One row for each feature, one column for each mutational population, rows should sum to 1. |
read_vect |
Vector of length = |
sample_name |
Character vector to assign to |
feature_prefix |
Name given to the i-th feature is |
kdeg_vect |
Vector of length = |
ksyn_vect |
Vector of length = |
logkdeg_mean |
If necessary, meanlog of a log-normal distribution from which kdegs are simulated |
logkdeg_sd |
If necessary, sdlog of a log-normal distribution from which kdegs are simulated |
logksyn_mean |
If necessary, meanlog of a log-normal distribution from which ksyns are simulated |
logksyn_sd |
If necessary, sdlog of a log-normal distribution from which ksyns are simulated |
phighs |
Vector of probabilities of mutation rates in labeled reads of each type denoted in
|
plows |
Vector of probabilities of mutation rates in unlabeled reads of each type denoted in
|
seqdepth |
Only relevant if |
readlength |
Length of simulated reads. In this simple simulation, all reads are simulated as being exactly this length. |
alpha_min |
Minimum possible value of alpha element of Dirichlet random variable |
alpha_max |
Maximum possible value of alpha element of Dirichlet random variable |
Ucont |
Probability that a nucleotide in a simulated read is a U. |
Acont |
Probability that a nucleotide in a simulated read is an A. |
Gcont |
Probability that a nucleotide in a simulated read is a G. |
Ccont |
Probability that a nucleotide in a simulated read is a C. |
List with two elements:
cB: Tibble that can be passed as the cB arg to EZbakRData().
ground_truth: Tibble containing simulated ground truth.
simdata <- SimulateMultiLabel(3)simdata <- SimulateMultiLabel(3)
In SimulateOneRep, users have the option to either provide vectors of feature-specific
read counts, fraction news, kdegs, and ksyns for the simulation, or to have those drawn
from relevant distributions whose properties can be tuned by the various optional
parameters of SimulateOneRep. The number of mutable nucleotides (nT) in
a read is drawn from a binomial distribution with readlength trials and a probability
of "success" equal to Ucont. A read's status as new or old is drawn from a Bernoulli
distribution with probability of "success" equal to the feature's fraction new. If a read
is new, the number of mutations in the read is drawn from a binomial distribution with
probability of mutation equal to pnew. If a read is old, the number of mutations is instead
drawn from a binomial distribution with probability of mutation equal to pold.
SimulateOneRep( nfeatures, read_vect = NULL, label_time = 2, sample_name = "sampleA", feature_prefix = "Gene", fn_vect = NULL, kdeg_vect = NULL, ksyn_vect = NULL, pnew = 0.05, pold = 0.002, logkdeg_mean = -1.9, logkdeg_sd = 0.7, logksyn_mean = 2.3, logksyn_sd = 0.7, seqdepth = nfeatures * 2500, readlength = 200, Ucont_alpha = 25, Ucont_beta = 75, feature_pnew = FALSE, pnew_kdeg_corr = FALSE, logit_pnew_mean = -2.5, logit_pnew_sd = 0.1 )SimulateOneRep( nfeatures, read_vect = NULL, label_time = 2, sample_name = "sampleA", feature_prefix = "Gene", fn_vect = NULL, kdeg_vect = NULL, ksyn_vect = NULL, pnew = 0.05, pold = 0.002, logkdeg_mean = -1.9, logkdeg_sd = 0.7, logksyn_mean = 2.3, logksyn_sd = 0.7, seqdepth = nfeatures * 2500, readlength = 200, Ucont_alpha = 25, Ucont_beta = 75, feature_pnew = FALSE, pnew_kdeg_corr = FALSE, logit_pnew_mean = -2.5, logit_pnew_sd = 0.1 )
nfeatures |
Number of "features" (e.g., genes) to simulate data for |
read_vect |
Vector of length = |
label_time |
Length of s^4^U feed to simulate. |
sample_name |
Character vector to assign to |
feature_prefix |
Name given to the i-th feature is |
fn_vect |
Vector of length = |
kdeg_vect |
Vector of length = |
ksyn_vect |
Vector of length = |
pnew |
Probability that a T is mutated to a C if a read is new. |
pold |
Probability that a T is mutated to a C if a read is old. |
logkdeg_mean |
If necessary, meanlog of a log-normal distribution from which kdegs are simulated |
logkdeg_sd |
If necessary, sdlog of a log-normal distribution from which kdegs are simulated |
logksyn_mean |
If necessary, meanlog of a log-normal distribution from which ksyns are simulated |
logksyn_sd |
If necessary, sdlog of a log-normal distribution from which ksyns are simulated |
seqdepth |
Only relevant if |
readlength |
Length of simulated reads. In this simple simulation, all reads are simulated as being exactly this length. |
Ucont_alpha |
Probability that a nucleotide in a simulated read from a given feature
is a U is drawn from a beta distribution with shape1 = |
Ucont_beta |
Probability that a nucleotide in a simulated read from a given feature
is a U is drawn from a beta distribution with shape2 = |
feature_pnew |
Boolean; if TRUE, simulate a different pnew for each feature |
pnew_kdeg_corr |
Boolean; only relevant if |
logit_pnew_mean |
If |
logit_pnew_sd |
If |
List with two elements:
cB: Tibble that can be passed as the cB arg to EZbakRData().
ground_truth: Tibble containing simulated ground truth.
simdata <- SimulateOneRep(30)simdata <- SimulateOneRep(30)
fraction_design table for EstimateFractions
An example fraction_design table for a standard NR-seq experiment with
s^4U labeling. This table tells EstimateFractions that there are two
populations of reads, one with high T-to-C mutation content and one
with low T-to-C mutation content
standard_fraction_designstandard_fraction_design
standard_fraction_designA tibble with 2 rows and 2 columns:
Boolean denoting if population represented by that row has high T-to-C mutational content
Boolean denoting if population represented by that row is expected to be present in this dataset
fraction_design table for EstimateFractions
An example fraction_design table for a TILAC experiment. TILAC was originally
described in Courvan et al., 2022. In this
method, two populations of RNA, one from s^4U fed cells and one from s^6G fed cells, are pooled
and prepped for sequencing together. This allows for internally controlled comparisons of RNA
abundance without spike-ins. s^4U is recoded to a cytosine analog by TimeLapse chemistry
(or similar chemistry) and s^6G is recoded to an adenine analog. Thus, fraction_design includes
columns called TC and GA. A unique aspect of the TILAC fraction_design table is that
one of the possible populations, TC and GA both TRUE, is denoted as not present (present = FALSE).
This is because there is no RNA was exposed to both s^4U and s^6G, thus a population of reads
with both high T-to-C and G-to-A mutational content should not exist.
tilac_fraction_designtilac_fraction_design
tilac_fraction_designA tibble with 4 rows and 3 columns:
Boolean denoting if population represented by that row has high T-to-C mutational content
Boolean denoting if population represented by that row has high G-to-A mutational content
Boolean denoting if population represented by that row is expected to be present in this dataset
EZbakRArrowData EZbakRArrowData validatorvalidate_EZbakRArrowData ensures that input for EZbakRArrowData object construction
is valid.
validate_EZbakRArrowData(obj)validate_EZbakRArrowData(obj)
obj |
An object of class |
EZbakRDataobject validatorvalidate_EZbakRData ensures that input for EZbakRData construction
is valid.
validate_EZbakRData(obj)validate_EZbakRData(obj)
obj |
An object of class |
EZbakRFractions object validatorvalidate_EZbakRFractions ensures that input for EZbakRFractions construction
is valid.
validate_EZbakRFractions(obj)validate_EZbakRFractions(obj)
obj |
An object of class |
EZbakRKinetics object validatorvalidate_EZbakRKinetics ensures that input for EZbakRKinetics construction
is valid.
validate_EZbakRKinetics(obj, features)validate_EZbakRKinetics(obj, features)
obj |
An object of class |
features |
Features tracked in |
Generalizes SimulateOneRep() to simulate any combination of mutation types. Currently, no kinetic model is used to relate certain parameters to the fractions of reads belonging to each simulated mutational population. Instead these fractions are drawn from a Dirichlet distribution with gene-specific parameters.
VectSimulateMultiLabel( nfeatures, populations = c("TC"), fraction_design = create_fraction_design(populations), fractions_matrix = NULL, read_vect = NULL, sample_name = "sampleA", feature_prefix = "Gene", kdeg_vect = NULL, ksyn_vect = NULL, logkdeg_mean = -1.9, logkdeg_sd = 0.7, logksyn_mean = 2.3, logksyn_sd = 0.7, phighs = stats::setNames(rep(0.05, times = length(populations)), populations), plows = stats::setNames(rep(0.002, times = length(populations)), populations), seqdepth = nfeatures * 2500, readlength = 200, alpha_min = 3, alpha_max = 6, Ucont = 0.25, Acont = 0.25, Gcont = 0.25, Ccont = 0.25 )VectSimulateMultiLabel( nfeatures, populations = c("TC"), fraction_design = create_fraction_design(populations), fractions_matrix = NULL, read_vect = NULL, sample_name = "sampleA", feature_prefix = "Gene", kdeg_vect = NULL, ksyn_vect = NULL, logkdeg_mean = -1.9, logkdeg_sd = 0.7, logksyn_mean = 2.3, logksyn_sd = 0.7, phighs = stats::setNames(rep(0.05, times = length(populations)), populations), plows = stats::setNames(rep(0.002, times = length(populations)), populations), seqdepth = nfeatures * 2500, readlength = 200, alpha_min = 3, alpha_max = 6, Ucont = 0.25, Acont = 0.25, Gcont = 0.25, Ccont = 0.25 )
nfeatures |
Number of "features" (e.g., genes) to simulate data for |
populations |
Vector of mutation populations you want to simulate. |
fraction_design |
Fraction design matrix, specifying which potential mutational populations should actually exist. See ?EstimateFractions for more details. |
fractions_matrix |
Matrix of fractions of each mutational population to simulate. If not provided, this will be simulated. One row for each feature, one column for each mutational population, rows should sum to 1. |
read_vect |
Vector of length = |
sample_name |
Character vector to assign to |
feature_prefix |
Name given to the i-th feature is |
kdeg_vect |
Vector of length = |
ksyn_vect |
Vector of length = |
logkdeg_mean |
If necessary, meanlog of a log-normal distribution from which kdegs are simulated |
logkdeg_sd |
If necessary, sdlog of a log-normal distribution from which kdegs are simulated |
logksyn_mean |
If necessary, meanlog of a log-normal distribution from which ksyns are simulated |
logksyn_sd |
If necessary, sdlog of a log-normal distribution from which ksyns are simulated |
phighs |
Vector of probabilities of mutation rates in labeled reads of each type denoted in
|
plows |
Vector of probabilities of mutation rates in unlabeled reads of each type denoted in
|
seqdepth |
Only relevant if |
readlength |
Length of simulated reads. In this simple simulation, all reads are simulated as being exactly this length. |
alpha_min |
Minimum possible value of alpha element of Dirichlet random variable |
alpha_max |
Maximum possible value of alpha element of Dirichlet random variable |
Ucont |
Probability that a nucleotide in a simulated read is a U. |
Acont |
Probability that a nucleotide in a simulated read is an A. |
Gcont |
Probability that a nucleotide in a simulated read is a G. |
Ccont |
Probability that a nucleotide in a simulated read is a C. |
List with two elements:
cB: Tibble that can be passed as the cB arg to EZbakRData().
ground_truth: Tibble containing simulated ground truth.
simdata <- VectSimulateMultiLabel(30)simdata <- VectSimulateMultiLabel(30)
Plots a measure of dropout (the ratio of -label to +label RPM) as a function of feature fraction new, with the model fit depicted. Use this function to qualitatively assess model fit and whether the modeling assumptions are met.
VisualizeDropout( obj, plot_type = c("grandR", "bakR"), grouping_factors = NULL, features = NULL, populations = NULL, fraction_design = NULL, repeatID = NULL, exactMatch = TRUE, n_min = 50, dropout_cutoff = 5, ... )VisualizeDropout( obj, plot_type = c("grandR", "bakR"), grouping_factors = NULL, features = NULL, populations = NULL, fraction_design = NULL, repeatID = NULL, exactMatch = TRUE, n_min = 50, dropout_cutoff = 5, ... )
obj |
An EZbakRFractions object, which is an EZbakRData object on which
you have run |
plot_type |
Which type of plot to make. Options are:
|
grouping_factors |
Which sample-detail columns in the metadf should be used
to group -s4U samples by for calculating the average -s4U RPM? The default value of
|
features |
Character vector of the set of features you want to stratify
reads by and estimate proportions of each RNA population. The default of |
populations |
Mutational populations that were analyzed to generate the fractions table to use. For example, this would be "TC" for a standard s4U-based nucleotide recoding experiment. |
fraction_design |
"Design matrix" specifying which RNA populations exist
in your samples. By default, this will be created automatically and will assume
that all combinations of the |
repeatID |
If multiple |
exactMatch |
If TRUE, then |
n_min |
Minimum raw number of reads to make it to plot |
dropout_cutoff |
Maximum dropout value included in plot. |
... |
Parameters passed to internal |
A list of ggplot2 objects, one for each +label sample.
# Simulate data to analyze simdata <- EZSimulate(30) # Create EZbakR input ezbdo <- EZbakRData(simdata$cB, simdata$metadf) # Estimate Fractions ezbdo <- EstimateFractions(ezbdo) # Visualize Dropout ezbdo <- VisualizeDropout(ezbdo)# Simulate data to analyze simdata <- EZSimulate(30) # Create EZbakR input ezbdo <- EZbakRData(simdata$cB, simdata$metadf) # Estimate Fractions ezbdo <- EstimateFractions(ezbdo) # Visualize Dropout ezbdo <- VisualizeDropout(ezbdo)