Monte Carlo Simulation: Some Fundamentals

class: title-slide, left, bottom

<font size="50"> Monte Carlo Simulation: Some Fundamentals </font>

**Ashley I Naimi, PhD** <br> Associate Professor <br> Emory University <br>

---
# Simulations

.font150[

* General Principles

* Types of Simulation

* Let's Simulate

* Analyzing and Reporting

* Considerations

* Further Exploration

]

---
# General Principles behind Data Simulation

.font150[

* Why Simulate?

* Exploring
    
    * Learning
    
    * Teaching
    
    * Research
    
]

---
# General Principles behind Simulation

.font150[

* Every sample has an underlying cohort
    
* Clean code
    
* Efficient code
    
* Reproducible code

]

---
# Types of Simulation

.font150[

* <span style="color: red"> Systems Dynamics and Agent Based Models </span>

* <span style="color: red"> The Parametric G Formula (aka G Computation) </span>
    
* Monte Carlo Simulation

* Regular 
    
    * Plasmode

]

---
# Let's Simulate Data

.pull-left-a-little[

]

.pull-right-a-lot[

```r
set.seed(123)
expit <- function(x){1/(1+exp(-x))}
n <- 500
id <- 1:500
c1 <- rbinom(n, 1, .5)
c2 <- rbinom(n, 1, .5)
x <- rbinom(n,1,expit(-2 + log(2)*c1 + log(2)*c2))
y <- 30 + 3*x + 3*c1 + 3*c2 + rnorm(n,0,1)

sim_dat <- data.frame(id,y,x,c1,c2)

head(sim_dat)
```

```
##   id    y x c1 c2
## 1  1 31.5 0  0  0
## 2  2 32.9 0  1  0
## 3  3 30.5 0  0  0
## 4  4 36.2 1  1  0
## 5  5 35.8 1  1  0
## 6  6 29.9 0  0  0
```

]

---
# Let's Simulate Data

```r
mod1 <- glm(y ~ x + c1 + c2, data=sim_dat,family=gaussian("identity"))
coeftest(mod1, vcov = vcov(mod1))[2,]
```

```
##   Estimate Std. Error    z value   Pr(>|z|) 
##   3.04e+00   1.17e-01   2.60e+01  2.39e-149
```

```r
ps <- glm(x ~ c1 + c2, data=sim_dat,family=binomial("logit"))$fitted.values
sim_dat$sw <- x*(mean(x)/ps) + (1-x)*((1-mean(x))/(1-ps))
summary(sim_dat$sw)
```

```
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.602   0.887   0.985   1.000   1.006   2.069
```

```r
mod2 <- glm(y ~ x, data=sim_dat,family=gaussian("identity"),weights=sw)
coeftest(mod2, vcov = vcovCL(mod2, cluster=sim_dat$id, type = "HC1"))[2,]
```

```
##   Estimate Std. Error    z value   Pr(>|z|) 
##   3.06e+00   2.91e-01   1.05e+01   8.90e-26
```

.font150[

<span style="color: red"> How different are the estimates of `$\beta_X$` when using outcome regression versus IP weighting? </span>

]

---
# Simulation Study

```r
set.seed(123)
expit <- function(x){1/(1+exp(-x))}

simulation_function <- function(z){
  n <- 500
  id <- 1:500
  c1 <- rbinom(n, 1, .5)
  c2 <- rbinom(n, 1, .5)
  x <- rbinom(n,1,expit(-2 + log(2)*c1 + log(2)*c2))
  y <- 30 + 3*x + 3*c1 + 3*c2 + rnorm(n,0,1)

sim_dat <- data.frame(id,y,x,c1,c2)
  
  mod1 <- glm(y ~ x + c1 + c2, data=sim_dat,family=gaussian("identity"))
  coef1 <- coeftest(mod1, vcov = vcov(mod1))[2,1:2]
  
  ps <- glm(x ~ c1 + c2, data=sim_dat,family=binomial("logit"))$fitted.values
  sim_dat$sw <- x*(mean(x)/ps) + (1-x)*((1-mean(x))/(1-ps))

mod2 <- glm(y ~ x, data=sim_dat,family=gaussian("identity"),weights=sw)
  coef2 <- coeftest(mod2, vcov = vcovCL(mod2, cluster=sim_dat$id, type = "HC1"))[2,1:2]
  
  res <- rbind(cbind(1,t(coef1)),
               cbind(2,t(coef2)))
  
  return(res)
}
```

---
# Simulation Study

```r
set.seed(123)
res <- lapply(1:10000, function(ii) simulation_function(ii))
res <- do.call(rbind,res)
res <- data.frame(res)
names(res) <- c("Model","Estimate","SE")
head(res)
```

```
##   Model Estimate     SE
## 1     1     3.04 0.1169
## 2     2     3.06 0.2911
## 3     1     2.86 0.0997
## 4     2     2.82 0.2727
## 5     1     3.16 0.1103
## 6     2     3.21 0.2598
```

```r
res %>% group_by(Model) %>% summarize(bias = mean(3 - Estimate),
                                      sd_beta = sd(Estimate),
                                      se_beta = mean(SE))
```

```
## # A tibble: 2 x 4
##   Model      bias sd_beta se_beta
## * <dbl>     <dbl>   <dbl>   <dbl>
## 1     1  0.000248   0.109   0.110
## 2     2 -0.000681   0.136   0.277
```

---
# Simulation Study

---
# Setting up and Analyzing Simulation Data

.font150[

* Monte Carlo simulations are experiments

* For a Monte Carlo sample size `$M$`, with error `$e$`

* for error in bias, `$M = \frac{SD\{ \hat{B}_M(\psi)^2 \} }{e^2}$`
    
    * for error in proportions, `$M = \frac{4 p (1-p)}{e^2}$`

]

---
# Simulation Considerations

.font150[

* Output and save more than you might need

* Keep MC samples small at first

* Be aware of how you parameterized the simulations

* Leaving something (e.g., interaction) is assuming something is zero!

]

---
# For Further Exploration

.font150[

* R functions

* lapply, mapply, and mclapply

* parallel processing

]

---

# Some Papers for Further Reading

.font150[

* Morris et al (2019) Using simulation studies to evaluate statistical methods. Stat Med. 38(11): 2074-2102.

* Rudolph et al (2020) Simulation as a Tool for Teaching and Learning Epidemiologic Methods. Am J Epidemiol. https://doi.org/10.1093/aje/kwaa232

* Rudolph et al (2021) Simulation in Practice: The Balancing Intercept. Am J Epidemiol. https://doi.org/10.1093/aje/kwab039

]