```
library(artemis)
```

(Adapted from manuscript Espe et al, in preparation)

Simulating data is extremely useful. It's like a sandbox to play and practice in, allowing us to test assumptions, visualize model relationships, and conduct power analyses on hypothetical experiments.

The `artemis`

package includes two flexible functions for simulating
data: `sim_eDNA_lm()`

and `sim_eDNA_lmer()`

. These functions use the same data generation process outlined in the Getting Started with `artemis`

vignette.

In this vignette, we will work through the three main steps of the data simulation process in `artemis`

, and show how the results of data simulation can be stored, summarized, and plotted.

In order to simulate data, we first must provide a list of variables and their associated levels. For example, if the hypothetical survey or experiment had varying levels of the variables `distance`

, `volume`

,`tech_rep`

, and `rep`

, we might create the following complete list of variables:

```
vars = list(Cq = 1,
Intercept = 1,
distance = c(0, 15, 50),
volume = c(25, 50),
tech_rep = 1:10,
rep = 1:3)
```

There is nothing special about the names used here, other than that they must be matched in the formula eventually expressed in `sim_eDNA_lm()`

. The variables can have as many levels as we like, and be categorical or continuous (we would wrap categorical variables with `as.factor()`

).

We included two dummy variables in our list: one for the response (“`Cq`

”), and one for the intercept. The response dummy variable is used internally by the simulation functions to create the model matrix. We could have omitted the intercept variable here, but the `model.matrix()`

function that gets called downstream will create one by default, so it's good practice to include it in our variable list.

Next, we must provide effect sizes (`betas`

) for the variables we plan to include in the model formula. The values we assign the `betas`

will no doubt depend on our reasons for doing the simulation. If we have previous model results for the effects of distance and volume, for example, we might use those for our `betas`

vector:

```
betas = c(intercept = -10.6, distance = -0.05, volume = 0.02)
```

By default, the first beta value in `betas`

will be considered the term for the intercept. The names of the `betas`

don't need to be the same as they are in the `vars`

list above, or even the model formula expression, but the `sim_eDNA_*()`

functions do reference the `betas`

vector with respect to the model formula. This means that it's the *order* of the variables between the `betas`

vector and the model formula that needs to match, and that there needs to be one beta for each predictor in the model formula.

When specifying the `betas`

values, it helps to remember that all the model parameters are on the scale of \(log[eDNA]\), including the intercept. Here we specified the intercept to be -10, which came from previous modeling results but also corresponds to `exp(-10)`

\([eDNA]\). This is a very small number, and thus a proxy for eDNA that is very rare in the environment before any covariates are considered. Often if we're getting simulation results we find bizarre or surprising, the scale of the `betas`

is the place to start debugging.

`sim_eDNA_*()`

functions to simulate dataWith our variable list and `betas`

decided, we're ready to simulate data! For fixed-effects variables, we use the `sim_eDNA_lm()`

function. This function takes a model formula (with syntax identical to that used in the `lme4`

package), the variable list and the `betas`

vector, but it also requires the parameters from a standard curve equation for the conversion formula between \(log[eDNA]\) and Cq values to the simulation function, as well as the measurement error on Cq (`sigma_ln_eDNA`

):

```
set.seed(1234)
ans = sim_eDNA_lm(Cq ~ distance + volume,
variable_list = vars,
betas = betas,
sigma_ln_eDNA = 1.5,
std_curve_alpha = 21.2,
std_curve_beta = -1.5)
#> Chain 1:
#> Chain 1: Elapsed Time: 0 seconds (Warm-up)
#> Chain 1: 4.7e-05 seconds (Sampling)
#> Chain 1: 4.7e-05 seconds (Total)
#> Chain 1:
```

The returned object from the `sim_eDNA_*`

functions isan object of class “eDNA_simulation”, which is essentially a list
containing: 1) the latent variable `ln_conc`

, 2) the equivalent Cq value
before truncation and measurement error `Cq_hat`

, and 3) the estimated Cq value
following the truncation and measurement error `Cq_star`

:

```
str(ans, max.level = 3)
#> Formal class 'eDNA_simulation' [package "artemis"] with 10 slots
#> ..@ ln_conc_hat : num [1, 1:180] -11.04 -10.76 -10.61 -11.45 -8.91 ...
#> .. ..- attr(*, "dimnames")=List of 2
#> ..@ ln_conc_star : num [1, 1:180] -10.1 -10.8 -12.6 -9.6 -10.3 ...
#> .. ..- attr(*, "dimnames")=List of 2
#> ..@ Cq_star : num [1, 1:180] 37.8 37.3 37.1 38.4 34.6 ...
#> .. ..- attr(*, "dimnames")=List of 2
#> ..@ formula :Class 'formula' language Cq ~ distance + volume
#> .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
#> ..@ variable_levels:List of 6
#> .. ..$ Cq : num 1
#> .. ..$ Intercept: num 1
#> .. ..$ distance : num [1:3] 0 15 50
#> .. ..$ volume : num [1:2] 25 50
#> .. ..$ tech_rep : int [1:10] 1 2 3 4 5 6 7 8 9 10
#> .. ..$ rep : int [1:3] 1 2 3
#> ..@ betas : Named num [1:3] -10.6 -0.05 0.02
#> .. ..- attr(*, "names")= chr [1:3] "intercept" "distance" "volume"
#> ..@ x :'data.frame': 180 obs. of 3 variables:
#> .. ..$ (Intercept): num [1:180] 1 1 1 1 1 1 1 1 1 1 ...
#> .. ..$ distance : num [1:180] 0 15 50 0 15 50 0 15 50 0 ...
#> .. ..$ volume : num [1:180] 25 25 25 50 50 50 25 25 25 50 ...
#> ..@ std_curve_alpha: num 21.2
#> ..@ std_curve_beta : num -1.5
#> ..@ upper_Cq : num 40
```

We can inspect empirical summaries of marginal distributions in the simulations with `summary()`

,

```
summary(ans)
#> variable level 2.5% 50% 97.5% mean p_detect
#> 1 (Intercept) 1 31.99562 37.60102 40 37.29959 0.7888889
#> 2 distance 0 31.44763 36.07375 40 35.92179 0.9333333
#> 3 distance 15 33.87386 37.39391 40 37.31752 0.8500000
#> 4 distance 50 34.86472 39.34097 40 38.65945 0.5833333
#> 5 volume 25 32.42318 37.44094 40 37.32440 0.8222222
#> 6 volume 50 31.74474 37.84910 40 37.27477 0.7555556
```

…which summarizes the response (Cq on its natural scale) across each level of the variables used to simulate the data.

** WARNING**:

The last column of the table returned by `summary`

is the percent of observations that represented positive detections, i.e. a Cq value below the upper cycle threshold for non-detection (set to 40 by default). In the output above, 93% of the simulated data with `distance`

= 0 was below 40.0.

By default, the `sim_eDNA_*`

functions will only simulate one instance
of the hypothetical survey/experiment, but we can easily simulate many
datasets by adjusting the `n_sim`

parameter. This becomes especially useful when we want to do power analyses on simulated results.

```
ans2 = sim_eDNA_lm(Cq ~ distance + volume,
variable_list = vars,
betas = betas,
sigma_ln_eDNA = 1,
std_curve_alpha = 21.2,
std_curve_beta = -1.5,
n_sim = 500) # specifies the number of simulated datasets to generate
```

When multiple simulations are requested, each of these will be stored in a matrix, with each column representing a single simulated instance.

The results of the simulations can be plotted with `plot()`

:

```
plot(ans, alpha = 0.5, jitter_width = 0.2)
#> [[1]]
#>
#> [[2]]
#>
#> [[3]]
```

Similar to `summary()`

, each plot panel shows the marginal distribution of the response (either Cq or \(log[eDNA]\)) across the levels of the variables used to simulate the data.

`plot()`

is a generic method intended for quick visualization of simulations. When we wish to create custom plots, we can convert a simulated dataset to a `dataframe`

by calling `as.data.frame()`

or `data.frame()`

on the simulation object, and then plot them with our method of choice. Here is an example using `ggplot2`

:

```
simsdf <- data.frame(ans)
# custom plot of simulated data
ggplot(simsdf, aes(x = factor(distance), y = Cq_star)) +
geom_jitter(aes(color = factor(volume)),
width = 0.05,
alpha = 0.65) +
geom_boxplot(alpha = 0.5,
width = 0.2,
size = 0.5) +
theme_minimal() +
labs(x = "Distance (m)", y = "Estimated Cq")
```

To simulate from a mixed-effects model, we use the function
`sim_eDNA_lmer()`

. Random effects are specified using the syntax of
the `lme4`

package, e.g. `(1|random effect)`

:

```
ans3 = sim_eDNA_lmer(Cq ~ distance + volume + (1|rep) + (1|tech_rep),
variable_list = vars,
betas = c(intercept = -10.6, distance = -0.05, volume = 0.01),
sigma_ln_eDNA = 1,
sigma_rand = c(0.1, 0.1), #stdev of the random effects
std_curve_alpha = 21.2,
std_curve_beta = -1.5)
#> Chain 1:
#> Chain 1: Elapsed Time: 0 seconds (Warm-up)
#> Chain 1: 6.1e-05 seconds (Sampling)
#> Chain 1: 6.1e-05 seconds (Total)
#> Chain 1:
```

When we are simulating data with random effects, we must provide the
standard deviation of the random effects (`sigma_rand`

, a vector containing one value for each random effect `beta`

), in addition to the information for the fixed effects.

Random effects are
assumed to be generated from a normal distribution for these
simulations using this standard deviation (`sigma_rand`

). When
`n_sims`

> 1, different random effects are generated for each
simulation.

By default, the simulation functions construct a model matrix assuming a
complete block design (i.e. every level of each variable is
represented). If we want to simulate data for an incomplete block design, we can
specify our own variable `data.frame`

, `X`

, to create a
model matrix.

When we specified the `vars`

above, we included variables that we did not use in the formula. By
design, the simulation and modeling functions ignore extra variables in the
`variable_list`

or `data`

inputs. This is just a workflow perk; it allows us to create a comprehensive list of variables we might be
interested in, and then we only need adjust the model formula to change the
simulations.

Sometimes we get simulation results that seem very surprising, and often our first instinct is to suspect the model. When this happens, it can be helpful to remember that the model underlying the simulation functions is highly deterministic to the expected value - which is to say that barring any bugs in the code, there is no stochasticity in the calculations that `artemis`

performs; they are mathematically sound when the pieces are right. There is a random component in simulating the random error, but even that should play out predictably for us. Thus if we're getting surprising or nonsensical simulations, the trouble probably lies with the assumptions or the data we provided to the model. When we go to look for the source of the unexpected behavior, the input data is almost always the best place to start.