class: title-slide, left, bottom <span style="font-size: 50px;"> Applied Statistics Requires Scientific Context </span> <br> <span style="font-size: 35px;"> The Role of Epidemiologists in Statistics Reform </span> <br> <br><br><br><br><br><br><br><br> <br><br> <table style="border: none; border-collapse: collapse; margin-left: 0; margin-top: -50px; float: left;"> <tr style="border: none; background: transparent; line-height: 0.8;"> <td style="border: none; border-right: 1px solid #6ca3d9; padding: 2px 5px; background: transparent;"><strong>Ashley I Naimi, PhD</strong></td> <td style="border: none; padding: 2px 5px; background: transparent;">Dept of Epidemiology</td> </tr> <tr style="border: none; background: transparent; line-height: 0.8;"> <td style="border: none; border-right: 1px solid #6ca3d9; padding: 2px 5px; background: transparent;">Professor</td> <td style="border: none; padding: 2px 5px; background: transparent;">Emory University</td> </tr> </table> <div style="line-height: 1.2;">
<a href="mailto:ashley.naimi@emory.edu">ashley.naimi@emory.edu</a> <br>
<a href="https://ainaimi.github.io/">https://ainaimi.github.io/</a> </div> <img src="images/qr_code.svg" class="qr-code" alt="QR Code"> --- # Acknowledgements .font150[ - This work benefitted from discussion with several colleagues and friends: - Enrique Schisterman, Sunni Mumford, Stephanie Hinkle, Alexander Levin, and Eric Tchetgen Tchetgen at UPenn - Larry Wasserman and Aaditya Ramdas at CMU - Stephen Cole at UNC - Sander Greenland at UCLA - Please feel free to report errors and send comments to me: [ashley.naimi@emory.edu](mailto:ashley.naimi@emory.edu) ] --- # My Goal with This Talk .middle-content[ .font150[ Convince you that we (epidemiologists/biostatisticians) have an important role to play in statistics (science) reform ] ] --- # Overview .font120[ * Background: The Mis-Use of Statistics * Statistics and "Context" * P Values — Divergence and Decision * Significance Thresholds and Context Considerations in Practice - Low Dose Aspirin and Pregnancy Loss - Tofacitinib and Ankylosing Spondylitis * Threats to Validity Arise from Context * Summary: Meno's Paradox of Inquiry * The Role of Epidemiologists <!-- - Applied statsitics needs epidemiologic thinking. start with the main point that Tukey makes in 1962: that "data science" should not be mathematical A goal of mathematical statisticians: develop and evaluate data analysis methods that are generally applicable, and thus decontextualized. This is important, and ... --> ] --- <br><br><br><br> .font150[ > ... there is no royal road to statistical induction. .right[— Jacob Cohen (1990) *Am Psychol* 45(12):1304-1312] ] --- # Background: The Mis-Use of Statistics Statistical methods play a central role in science. But, ... -- <img src="images/s1.png" class="random-img-1" style="width: 500px;"> <img src="images/s2.png" class="random-img-2" style="width: 500px;"> <img src="images/s3.png" class="random-img-3" style="width: 500px;"> <img src="images/s4.png" class="random-img-4" style="width: 500px;"> <img src="images/s5.png" class="random-img-5" style="width: 500px;"> <img src="images/s6.png" class="random-img-6" style="width: 500px;"> <img src="images/s7.png" class="random-img-7" style="width: 500px;"> <img src="images/s8.png" class="random-img-8" style="width: 500px;"> <img src="images/s9.png" class="random-img-9" style="width: 500px;"> <img src="images/s10.png" class="random-img-10" style="width: 500px;"> <img src="images/s11.png" class="random-img-11" style="width: 500px;"> <img src="images/s12.png" class="random-img-12" style="width: 500px;"> <img src="images/s13.png" class="random-img-13" style="width: 500px;"> <img src="images/s14.png" class="random-img-14" style="width: 500px;"> ??? Statistical methods play a central role in science. However, there has long been a vigorous discussion and debate on the relationships between statistical methods and the generation and validation of scientific knowledge. This discussion has focused on a number of topics, including: - how p values and other inferential measures are used and taught (Wasserstein and Lazaar 2016, Rafi and Greenland 2020, Alawbathani et al 2021) - the interpretation of p values and related metrics such as s values, and compatibility metrics (Cole et al 2021, Greenland 2019a, Greenland 2022a, Greenland 2023, Greenland 2023a) - their relation to standards of evidence (Peskun2020, vanZwet2023, Gibson2021) - the selection and utility of p value thresholds (McShane2019, Benjamin2018, Hemerik2025, Maier2022) - the impact of heuristic tools and cognitive biases in the interpretation of statistical/scientific results (Rafi2020a) - the role of alternative frameworks, such as Bayesian decision making and inferential systems or, more recently, an alternative to the p value referred to as the "e value" (Goodman1995a, Bickel2022, Ramdas2025) --- # Background: Applied Statistics and Context .font130[ - Major subtext of this literature: "scientific context" - However, "scientific context" is an imprecise, undefined term. - This has led to important ambiguities, obfuscating precisely *how* one might act on various calls for statistics reform. ] ??? It's easy to read this literature and walk away thinking that the problems discussed are mathematical; that if we make some adjustment or augmentation to the way we calculate statistical measures, these problems will go away. A closer read of this literature reveals a subtext, or a major underlying theme, which is the role that scientific context must play when using statistical methods. However, scientific context is an imprecise term with no generally accepted sharp definition. This has led to an important ambiguity in how this term is used in the literature, obfuscating precisely how one might act on various calls for statistics reform. PAUSE -- .font130[ - scientific/clinical context is often used in one of two distinct ways: - Foundational: nebulous and elusive since this context will depend on the nature of the scientific question and the nuances of the area being studied. This kind of context will often be presented as ``subjective.'' - Quantitative: more superficial, easier to understand, often implicates the outputs of statistical methods applied to data. ] ??? In relation to the use and application of statistical methods, scientific or clinical context is often used in one of two distinct ways. The first is foundational, but nebulous and elusive since this context will depend in unpredictable ways on the nature of the scientific question and the nuances of the area being studied, and will often be presented as subjective. The second is more superficial and easier to understand, and is usually quantitative in nature. Its often based on the outputs of statistical methods applied to data. --- # Background: Applied Statistics and Context .font150[ There is an extensive literature dating back to at least the early 20th century on the role that the more nebulous version scientific context must play in the use of statistical methods. - **Jerzy Neyman**: on NHSTs, error thresholds should be selected "[a]ccording to the circumstances and ... subjective attitudes of the research worker." <a name=cite-Neyman1977></a>([Neyman, 1977](#bib-Neyman1977), p104) - **John Tukey**: results of a data analysis depend on *judgements* that arise in an attempt to mitigate the complexity of a particular experimental scenario. <a name=cite-Tukey1962></a>([Tukey, 1962](#bib-Tukey1962), p9, p46) ] <div class="footnote"> <p><cite><a id='bib-Neyman1977'></a><a href=>Neyman, J.</a> (1977). In: <em>Synthese</em> 36.1, pp. 97–131.</cite></p> <p><cite><a id='bib-Tukey1962'></a><a href=>Tukey, J. W.</a> (1962). In: <em>The Annals of Mathematical Statistics</em> 33.1, pp. 1–67.</cite></p> </div> ??? Recognition of the need to incorporate foundational context into applied statistics is not new. There is, in fact, an extensive literature dating back to at least the early 20th century. Several examples can be listed. Jerzy Neyman himself, in discussing the use of his significance testing framework, noted that error thresholds should be selected "[a]ccording to the circumstances and ... subjective attitudes of the research worker." John Tukey, in what might be considered the founding manifesto of modern data science, famously argued that the results of a data analysis depend on judgements that arise in an attempt to mitigate the complexity of a particular experimental scenario. --- # Background: Applied Statistics and Context .font150[ - **David Freedman**: The role that "shoe leather"—careful detective work conducted by painstakingly walking the streets of London to collect facts and validate assumptions—played in John Snow's work <a name=cite-Freedman1991></a>([Freedman, 1991](#bib-Freedman1991), p299) - **Philip Stark**: "[w]hat claims and appears to be 'science' may be a mechanical amplification of the opinions and *ad hoc* choices" that arise as the result of and are shaped by the specific contextual details of a scientific question. <a name=cite-Stark2022a></a>([Stark, 2022a](#bib-Stark2022a)) ] <div class="footnote"> <p><cite><a id='bib-Freedman1991'></a><a href=>Freedman, D. A.</a> (1991). In: <em>Sociological Methodology</em> 21, pp. 291–313.</cite></p> <p><cite><a id='bib-Stark2022a'></a><a href=>Stark, P. B.</a> (2022a). In: <em>Pure and Applied Geophysics</em> 179.11, pp. 4121–4145.</cite></p> </div> ??? David Freedman documented the role that "shoe leather" played in leading John Snow to end the cholera outbreak he was studying in the 1850s. By "shoe leather", he meant the careful detective work conducted by painstakingly walking the streets of London to collect facts and validate assumptions. Freedman argued that Snow's "brilliant detective work on nonexperimental data'' was impressive not because of the statistical techniques used, but because of "the handling of the scientific issues." Finally, In discussing the importance of models, Philip Stark notes that "[w]hat claims and appears to be 'science' may be a mechanical amplification of the opinions and ad hoc choices ... which lacks any tested, empirical basis." These "opinions and ad hoc choices" arise as the result of and are shaped by the specific contextual details of a scientific question. Several other similar such illustrations can be found throughout the scientific literature. --- # Background: Applied Statistics and Context .font150[ - **Rebecca Betensky**: The interpretation of p-values "must be considered in the context of certain features of the design and substantive application, such as sample size and meaningful effect size." <a name=cite-Betensky2019></a>([Betensky, 2019](#bib-Betensky2019), p115) - **Roychoudhury et al**: Propose dual-criterion phase II clinical trial design that incorporates statistical significance with "clinical relevance", defining the latter as a "sufficiently large effect estimate." <a name=cite-Roychoudhury2018></a>([Roychoudhury et al., 2018](#bib-Roychoudhury2018), p452) - **Minimal Clinically Important Differences**: A good entry article is <a name=cite-McGlothlin2014></a>([McGlothlin et al., 2014](#bib-McGlothlin2014)) ] <div class="footnote"> <p><cite><a id='bib-Betensky2019'></a><a href=>Betensky, R. A.</a> (2019). In: <em>The Am Stat</em> 73.1, pp. 115–117.</cite></p> <p><cite><a id='bib-Roychoudhury2018'></a><a href=>Roychoudhury, S. et al.</a> (2018). In: <em>Clinical Trials</em> 15.5, pp. 452–461.</cite></p> <p><cite><a id='bib-McGlothlin2014'></a><a href=>McGlothlin, A. E. et al.</a> (2014). In: <em>JAMA</em> 312.13, pp. 1342–1343.</cite></p> </div> ??? In contrast, there are many who define scientific or clinical context differently. For instance, in an article entitled "The P Value Requires Context, Not a Threshold," Rebecca Betensky argued that the interpretation of p-values "must be considered in the context of certain features of the design and substantive application, such as sample size and meaningful effect size." Roychoudhury et al proposed a dual-criterion phase II clinical trial design that incorporates statistical significance with "clinical relevance", defining the latter as no more than a "sufficiently large effect estimate." Of course, both of these are connected to a much broader literature on "minimal important differences," which reflect the smallest magnitude change in a study's outcome that would be considered "important" based on clinical or scientific context --- # Background: Applied Statistics and Scientific Context .font150[ - These two notions of scientific context are connected. - Features often described as "context" in the sharper sense will arise out of the more nebulously defined "context." - Foundational context is, arguably, more important ] ??? Importantly, these two notions of scientific context--foundational and quantitative--are connected. Features of study design and effect magnitudes often ascribed as "context" in the sharper sense will often arise out of aspects of the more nebulously defined notion of context. Foundational context is, therefore, arguably more important, since quantitative context depends on it Note, however, that I am not advocating for an abandonment of sharper contextual considerations. I believe these are an essential part of good science. What I am arguing is that we must understand where these sharper contextual considerations come from. --- # The P Value .center[ .middle-content[ .font150[ The probability of observing a result as extreme or more extreme than the one we observed if the test hypothesis was true, <span style="color: red;">and the underlying statistical model is correct.</span> ] ] ] ??? Now, rather than repeating that scientific context is important, but nebulously defined and elusive, I want to try to provide some more concrete insights. To do that, I will start by presenting a commonly used statistical tool, the p value. Furthermore, let's start by defining the p value in a format that is considered technically correct, but that has important limitations. Up until very recently, this definition of the p value is what I would teach in my courses. In this definition, we say that "the p value is the probability of observing a result as extreme or more extreme than the one we observed if the test hypothesis was true." I often make it a point to add and emphasize the additional consideration that the "underlying statistical model is correct" and that this so-called "underlying statistical model" is not referring to just the, e.g., logistic regression model we are using to conduct our analysis. Inevitably, students ask me what else the "Model" might be referring too. When pressed, statisticians will often refer to a varying collection of items that can include, possibly, the assumption of independent and identically distributed observations, no measurement error, no confounding, as well as many other items. Though correct, one reason why I think this definition of the p value is insufficient is that it is all too easy to either ignore the condition that our "the underlying statistical model is correct", or to simply mention it without actually understanding what this condition is. --- # The P Value: Divergence Interpretation .font120[ - The `\(p\)`-value: a quantile (or percentile) location measure of divergence between the actual data `\(z\)`, and what we'd expect the data to look like <span style="color: red;">under certain conditions/assumptions</span> <a name=cite-Greenland2023a></a><a name=cite-Perezgonzalez2015></a>([Greenland, 2023b](#bib-Greenland2023a); [Perezgonzalez, 2015](#bib-Perezgonzalez2015)). - Let's call these <span style="color: red;">conditions/assumptions</span> `\(M\)` - Suppose interest lies in the ITT effect in a double-blind placebo controlled trial: `$$\psi_{ITT} = E \left (Y^{a = 1} \right ) - E \left (Y^{a = 0} \right )$$` - `\(Y^a\)`: outcome that would be observed if an individual was randomized to treatment arm `\(A = a\)` ] <div class="footnote"> <p><cite><a id='bib-Greenland2023a'></a><a href=>Greenland, S.</a> (2023b). In: <em>Scandinavian Journal of Statistics</em> 50.1, pp. 54–88.</cite></p> <p><cite><a id='bib-Perezgonzalez2015'></a><a href=>Perezgonzalez, J. D.</a> (2015). ”. In: <em>Frontiers in Psychology</em> 6.</cite></p> </div> ??? So this brings us to a re-formulation, or re-definition of the P value introduce by Sander Greenland, building on some insights made in psychology. Following this work, we define the P value as a percentile location measure of divergence, distance, or separation between the data we collected, let's call this `\(z\)`, and what we'd expect these data to look like if a set of conditions or assumptions were true. For convenience, let's call the set of conditions and assumptions `\(M\)`. By way of example or illustration, let's say we're interested in estimating the intention to treat effect in a double blind placebo controlled trial, which we can define using potential outcomes. *** More technically, `\(M\)` is a model manifold that lies in the `\(n\)`-dimensional expectation space defined by the data generating mechanism that produced `\(z\)`. Importantly, this manifold `\(M\)` is a subset of the `\(Z\)`-space into which the conjunction of the model constraints (assumptions) and test hypothesis force the data expectation or predict where `\(z\)` would be were there no random variability. This manifold is the product of the \emph{logical conjunction} of the data and elements in `\(M\)`. I note here (and discuss later) that these assumptions and conditions depend heavily on the target parameter of interest and the study context and design. They are often broad, sometimes implicit, and mostly hard to verify or guarantee in practice. --- # The P Value: Divergence Interpretation .font150[ `$$\psi = E \left (Y^{a = 1} \right ) - E \left (Y^{a = 0} \right )$$` - This ITT effect can be estimated with our data under conditions/assumptions that constitute elements of `\(M\)`, such as: - (`\(M_1\)`) randomization "worked" - (`\(M_2\)`) blinding was maintained - (`\(M_3\)`) any loss to follow-up is MCAR ] ??? This ITT effect can be estimated with our data (or, is identified) under conditions or assumptions that constitute elements of M. For example, a non-exhaustive list of assumptions might include things like randomization worked (i.e., that the distribution of all covariates across the treatment group are "balanced"), that blinding was maintained or that any unblinding is inconsequential, that any loss to follow-up in the trial can be classified as, for example, missing completely at random. --- # The P Value: Divergence Interpretation .font150[ - Let's construct a `\(p\)`-value for this trial - We first add to our assumptions `\(M\)` our test hypothesis (the null), so that: - `\(M = \{ M_0: \psi = 0, M_1, M_2, M_3\}\)` - We then take the sample mean of the outcome in each treatment arm: `\(z = (\bar Y_1,\bar Y_0)\)` - This new `\(M\)` implies: `\(\hat{\psi} = \bar Y_1 - \bar Y_0\)` should be zero ] ??? To obtain a p value in this setting, we add to our set of assumptions the chosen test hypothesis value, which in our case will be the null. This null value, when combined with the other conditions/assumptions in M, implies that the difference in estimated treatment specific means should be zero. Importantly, it's not just null hypothesis in isolation that implies that our sample means should be equivalent. It is the entire set of assumptions in M that implies the difference in sample means should be zero. --- # The P Value: Divergence Interpretation <img src="images/p_value_combined.png" alt="" width="100%" style="display: block; margin: auto;" /> ??? FOR THE LEFT: - POINT OUT THE X AND Y AXIS - POINT OUT THE DIAGONAL - EXPLAIN THAT Z REPRESENTS THE SAMPLE ESTIMATE - EXPLAIN THAT D REPRESENTS HOW "FAR" THE SAMPLE ESTIMATE IS FROM THE DIAGONAL FOR THE RIGHT: - WE CAN CONVERT D INTO A DISTRIBUTION THAT TELLS US THE EXPECTED FREQUENCIES OF DIFFERENT D'S UNDER THE SET OF CONDTITIONS M - THE AREA UNDER THIS DISTRIBUTION CURVE CAN THEN BE USED TO MEASURE HOW "COMPATIBLE" OUR DATA ARE WITH ALL THE CONDITIONS IN M --- # The P Value: Divergence Interpretation <img src="images/pval_geometry.gif" alt="" width="100%" style="display: block; margin: auto;" /> ??? This figure shows that when p is 1, the data we observed are perfectly compatible with the assumptions in M, including our test --- # The P Value: Divergence Interpretation .font150[ - `\(p\)` measures the "consonance/divergence" between `\(z\)` and `\(M\)`. - Larger "divergences" yield smaller `\(p\)` → incompatibility between the <span style="color: red;">set of conditions (all of them) in `\(M\)`</span> and `\(z\)`. ] ??? What's useful about this definition of the p value is that it forces us to interpret the result we get as a measure of compability between the data AND the set of assumptions that must arise out of the scientific context of the study we are doing. This is in contrast to the alternative definition which, though technically correct, facilitates relegating these critical assumptions to the background. --- # The P Value: Decision Interpretation .font130[ - `\(p\)`-value: output of a decision criterion about whether or not to reject the model `\(M\)` as plausible model for the data generating mechanism. - Conventional approach: `\(\alpha\)` (Type I error), often 0.05; `\(\beta\)` (Type II error), often 0.2. - Two Problems: - 1) Error rates were intended to be chosen "[a]ccording to the circumstances and ... subjective attitudes of the research worker" ([Neyman, 1977](#bib-Neyman1977), p104). But standard thresholds are used even when these numerical values conflict with scientific aspects of the question under study. - 2) When `\(p < \alpha\)`, researchers are often compelled to reject the test hypothesis in isolation. But interpretation is only valid if one accepts as true all other elements of the model `\(M\)`. ] <div class="footnote"> <p><cite><a id='bib-Neyman1977'></a><a href=>Neyman, J.</a> (1977). In: <em>Synthese</em> 36.1, pp. 97–131.</cite></p> </div> ??? A second framework we can use to interpret the p value is that it is the output of a decision criterion about whether or not to reject the set `\(M\)` as a plausible model that generated the data. If we follow convention, we'd elect to reject `\(M\)` if the divergence p value was less than our alpha threshold of 0.05. But there are two problems with this approach: First, error rates were intended to be selected [a]ccording to the circumstances and ... subjective attitudes of the research worker. Unfortunately, standard thresholds are used even when these numerical values conflict with the scientific aspects of the study we want to conduct. Second, researchers are often compelled to reject the test hypothesis in isolation. But this interpretation is only valid if one accepts as true all of the other elements of the set of conditions/assumptions we denoted `\(M\)`. In the remainder of this talk, we'll go over some examples where these issues matter practically. --- # Context Considerations in Practice: The EAGeR Trial .font150[ - Low Dose Aspirin and Pregnancy Loss: EAGeR <a name=cite-Schisterman2014></a>([Schisterman et al., 2014](#bib-Schisterman2014)). - Motivation: unexplained recurrent miscarriage may be attributable to underlying inflammation <a name=cite-Silver2007></a>([Silver et al., 2007](#bib-Silver2007)). - Aspirin: anti-inflammatory, in use for over a century, affordable, well-known low risk profile. - LDA in use for nearly a decade to treat unexplained recurrent miscarriage, even though evidence of this effect was lacking. ] <div class="footnote"> <p><cite><a id='bib-Schisterman2014'></a><a href=>Schisterman, E. F. et al.</a> (2014). In: <em>Lancet</em> 384.9937, pp. 29–36.</cite></p> <p><cite><a id='bib-Silver2007'></a><a href=>Silver, R. M. et al.</a> (2007). In: <em>Clinical Obstetrics</em>. John Wiley & Sons, Ltd. Chap. 11, pp. 141–160.</cite></p> </div> ??? Let's start our practical discussion with the Effects of Aspirin on Gestation and Reproduction (EAGeR) Trial, a multicenter double-blind placebo controlled trial of the effect of daily low-dose (81 mg) aspirin on live birth outcomes among women who were trying to conceive, but who had experienced one or two prior pregnancy losses. The trial was motivated by the fact that unexplained recurrent miscarriage may be attributable to underlying inflammation. Aspirin is an anti-inflammatory drug that has been in use for over a century in the general population, it is affordable, and its side-effects are well understood and relatively low-risk. And critically, at the time when the trial was conducted, low dose aspirin had been used for nearly a decade in clinical settings to treat unexplained recurrent miscarriage, even though evidence of its effect was lacking. The EAGeR Trial was conducted to fill this evidence gap. --- # Context Considerations in Practice: The EAGeR Trial .font150[ - Aspirin use as SOP establishes a high tolerance for type I error. - Power calculations for the EAGeR trial were conducted for a 10% absolute difference live birth (two-sided `\(\alpha = 0.05; \beta = 0.20\)`) ([Schisterman et al., 2014](#bib-Schisterman2014)). - However, powering the trial using a higher `\(\alpha\)` threshold could have accomplished same objectives. - Per NIH Reporter, a total of roughly $10M was spent on EAGeR. Per participant, this is about $8.1K. Reducing sample size by 200 would have resulted in an estimated savings of roughly $1.6M. ] <div class="footnote"> <p><cite><a id='bib-Schisterman2014'></a><a href=>Schisterman, E. F. et al.</a> (2014). In: <em>Lancet</em> 384.9937, pp. 29–36.</cite></p> </div> ??? This scientific context matters for study design and analysis. Indeed, clinical use of aspirin to treat unexplained recurrent miscarriage in the absence of direct evidence of an effect speaks to a high tolerance for type I error. Power calculations for the EAGeR trial were conducted for a 10\% absolute difference in the probability of live birth on the basis of the standard thresholds (two-sided `\(\alpha = 0.05; \beta = 0.20\)`). However, powering the trial using a higher type I error rate could have resulted in a smaller sample size, and significant cost savings Per NIH Reporter, a total of roughly \$10M was spent on EAGeR. Rough calculations suggest that reducing sample size by 200 would have resulted in a savings of about \$1.6M. --- # Context Considerations in Practice: Tofacitinib and AS .pull-left-a-lot[ .font130[ - Contrast EAGeR with a phase III trial of <span style="color: red;">tofacitinib</span> for ankylosing spondylitis (AS) <a name=cite-Deodhar2021></a>([Deodhar et al., 2021](#bib-Deodhar2021)). - AS: arthritic condition that results in inflammation/fusing of the spinal column. - Tofacitinib is an inhibitor of the Janus Kinase (JAK) pathway system, implicated in several autoimmune disorders closely related to AS (rheumatoid and psoriatic arthritis). ] ] .pull-right-a-little[ <img src="images/as.jpg" alt="" width="100%" style="display: block; margin: auto;" /> ] <div class="footnote"> <p><cite><a id='bib-Deodhar2021'></a><a href=>Deodhar, A. et al.</a> (2021). In: <em>Annals of the Rheumatic Diseases</em> 80.8, pp. 1004–1013.</cite></p> image source: https://is.gd/fZGNjx </div> ??? In contrast to the EAGeR Trial, consider a phase III randomized double-blind placebo controlled trial of the effect of tofacitinib on ankylosing spondylitis (AS) among patients who have inadequately responded to or are intolerant of standard first line treatments (e.g., NSAIDs).\cite{Deodhar2021} Ankylosing spondylitis is an arthritic condition that results in inflammation and fusing of the spinal column, leading to potentially severe pain and immobility. As a therapeutic agent, tofacitinib is an inhibitor of the Janus Kinase (JAK) pathway system, which has been implicated in several autoimmune disorders closely related to AS, including rheumatoid and psoriatic arthritis. --- # Context Considerations in Practice: Tofacitinib and AS .font150[ - Patients randomized 1:1 to receive 5mg tofacitinib or placebo twice daily for 16 weeks of follow-up. - Primary endpoint: Assessment of SpondyloArthritis international Society ≥20% improvement (ASAS20) score (self reported change in condition score). ] <div class="footnote"> <p><cite><a id='bib-Deodhar2021'></a><a href=>Deodhar, A. et al.</a> (2021). In: <em>Annals of the Rheumatic Diseases</em> 80.8, pp. 1004–1013.</cite></p> image source: https://is.gd/fZGNjx </div> ??? Patients in the trial were randomized 1:1 to receive 5mg tofacitinib or placebo twice daily for 16 weeks of follow-up. The primary study endpoint was a self reported change score, referred to as the ASAS20 score. --- # Context Considerations in Practice: Tofacitinib and AS .font130[ - However, JAK inhibitors are much newer (2011): Long-term risk profile is unknown - Additionally, known risk profile for JAK inhibitors is far more severe: - serious infections (pneumonia, nasopharyngitis, UTIs, cellulitis, herpes zoster) - cardiovascular disease - cancer - GI perforations - anemia - liver conditions - Power calculations for the trial were conducted for a 20% absolute difference in ASAS20 for a two-sided `\(\alpha\)` threshold of 0.05. - However, in this case, a much lower tolerance for type I error could have been warranted, given the unknown long-term effects and serious risk profile. ] ??? Unlike aspirin, JAK inhibitors represent a much newer class of drugs, first introduced in 2011 (making their long term risks much harder to ascertain). Additionally, the known risk profile for JAK inhibitors is far more severe, and includes serious infections, cardiovascular disease, cancer, gastrointestinal perforations, anemia, and liver conditions. Power calculations for the trial were conducted for a 20\% absolute difference in ASAS20 after 16 weeks of follow up, and suggested that 120 participants per treatment arm would yield a power of 90\% for a two-sided `\(\alpha\)` threshold of 0.05. However, in this case, a much lower tolerance for type I error would have been warranted, given the unknown long-term effects and serious risk profile. So here, in contrasting these two examples we see why it's important to recognize that, if we indeed elect to use them, error rates should be selected on the basis of key contextual details of the scientific question at hand. On the one hand, studying the effects of aspirin on pregnancy loss admits to a much higher tolerance for type I error On the other hand, one could frankly argue that a type I error rate of 5% is too high for a drug like tofacitinib However, this is just the beginning of the issues here *** In this example I do not mention that an altogether different approach should have been used here: Standard clinical practice was to first use NSAIDs, and if the patient did not respond, to transition to anti-TNF drugs, which has a similar risk profile to JKIs, but one that clinicians are more familiar with. A best practices approach here would have been to conduct a non-inferiority trial to compare tofacitinib to the anti-TNF regime. This would have helped with the potential unblinding issues I highlight below. --- # Context Considerations and Validity: Tofacitinib and AS .font140[ - Significance tests are not immune to validity threats (e.g., systematic biases) <a name=cite-Greenland2016></a>([Greenland et al., 2016](#bib-Greenland2016)), even if one is willing to adapt the threshold to a particular scientific context. These threats affect both randomized trials and observational studies. - Tofacitinib is known to result in short-term dose-dependent changes in lipid concentrations, liver enzymes, creatine kinase, and blood counts → <span style="color: red;">functional unblinding</span> <a name=cite-VanderHeijde2017></a>([van der Heijde et al., 2017](#bib-VanderHeijde2017)). → patients in the treatment group could have reported better outcomes due to an <span style="color: red;">expectancy effect</span> <a name=cite-Huneke2025></a>([Huneke et al., 2025](#bib-Huneke2025)). ] <div class="footnote"> <p><cite><a id='bib-Greenland2016'></a><a href=>Greenland, S. et al.</a> (2016). In: <em>European Journal of Epidemiology</em> 31.4, pp. 337–350.</cite></p> <p><cite><a id='bib-VanderHeijde2017'></a><a href=>der Heijde, D. van et al.</a> (2017). In: <em>Annals of the Rheumatic Diseases</em> 76.8, pp. 1340–1347.</cite></p> <p><cite><a id='bib-Huneke2025'></a><a href=>Huneke, N. T. M. et al.</a> (2025). In: <em>JAMA Psychiatry</em> 82.5, pp. 531–538.</cite></p> </div> ??? Considering again the trial on tofacitinib, both patients and clinicians were blinded to treatment assignment. However, tofacitinib is known to result in short-term dose-dependent changes in lipid concentrations, liver enzymes, creatine kinase, and blood counts. Results of tests for these markers, as well as the side effect profile, could easily lead to **functional unblinding** patients in the trial. In fact, in an email conversation with the lead author of this trial, Atul Deohdar did confirm to me that they beleived unblinding was possible in the patients who received tofacitinib. Notably, because the primary outcome was a subjective measure of self reported improvement (ASAS20), it is possible that unblinded patients in the treatment group reported better outcomes due to an expectancy effect, where patients expect to be better, and thus report an improvement. --- # Context Considerations and Validity: Tofacitinib and AS .font150[ - Expectancy effects are not "biases" <a name=cite-Mansournia2017a></a>([Mansournia et al., 2017](#bib-Mansournia2017a)), but would threaten the validity of a study on a drug like tofacitinib. - Indeed, if expectancy effects overwhelm the physiological effects of tofacitinib, stricter testing procedures will only lead to stronger evidence for the wrong hypothesis, an issue often referred to as Type III error ([Stark, 2022a](#bib-Stark2022a)). ] <div class="footnote"> <p><cite><a id='bib-Mansournia2017a'></a><a href=>Mansournia, M. A. et al.</a> (2017). In: <em>Epidemiology (Cambridge, Mass.)</em> 28.1, pp. 54–59.</cite></p> <p><cite><a id='bib-Stark2022a'></a><a href=>Stark, P. B.</a> (2022a). In: <em>Pure and Applied Geophysics</em> 179.11, pp. 4121–4145.</cite></p> </div> ??? Such expectancy effects, though not technically biases,\cite{Mansournia2017a} would threaten the validity of a study on a drug like tofacitinib. If perceived improvements in AS result from an expectancy effect of being on the drug, and not the physiologic effects of the drug itself, patients and clinicians should wonder whether a drug with a risk profile like tofacitinib's is worth it. Indeed, if expectancy effects overwhelm the physiological effects of tofacitinib, stricter testing procedures will only lead to stronger evidence for the wrong hypothesis, otherwise known as a Type III error. --- # Summary: The Reflexive Challenge of Scientific Context .middle-content[ .font150[ 1) Context Determines Threshold Tolerance 2) Context Determines Validity ] ] ??? So, let's now take a moment to summarize the key points of what we just covered. First, if we elect to use them, significance thresholds should be chosen on the basis of the study's context. It doesn't make sense for use to use the same error tolerance threshold for a study like EAGeR versus a study of the effects of tofacitinib. Second, and far more importantly, whether we elect to use p value thresholds or not, the key driving force behind any analysis is the set of assumptions that shapes whether we can interpret the results the way we'd like to, or the way we think we can. These assumptions are shaped entirely by foundational context. --- # Summary: The Reflexive Challenge of Scientific Context .middle-content[ .font150[ - Part of the scientific task: what goes in `\(M\)` and do these elements hold? - This is not always easy. - Knowing `\(M\)` requires knowing systems under study. - Knowing systems under study requires knowing `\(M\)`. ] ??? The implication here is that much of our attention, energy, and focus should be on identifying and understanding the elements of scientific context that matters for the study. This is not always an easy task. Some of the issues I'm discussing here echo a tension recognized since at least as far back as Plato’s Socratic Dialogues, and captured in Meno’s “paradox of inquiry”.37(p230) When using statistical methods, meaningful evaluation of assumptions requires prior knowledge of the system under study, yet that knowledge is precisely what scientific inquiry aims to produce. The scientist can thus be caught in a reflexive loop: the validity of their tools depends on knowledge they may still be in the process of acquiring. This can make it difficult to even identify what are the relevant elements of M, which can lead to potential threats to a study’s validity that are hiding in the messy aspects of study context. --- <img src="images/Benjamin_image.png" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; object-fit: contain; filter: invert(1);"> <div style="position: relative; z-index: 10;"> -- <img src="images/speed_limit.png" alt="" width="40%" style="display: block; margin: auto;" /> </div> ??? So, when cast in this light, I think it's easy to see why calls like the one that Benjamin and the many co-authors of this 2018 call to lower significance thresholds are problematic. Though in mathematical and decontextualized settings, there may be some compelling reasons to do so (Benjamin et al use Bayesian arguments to make their mathematically compelling case) In applied settings, context matters, a lot. The best analogy I can think of is that arguing for universal significance thresholds would be tantamount to making a case for a single nation-wide speed limit for all roads across the country. It simply doesn't work in practice. --- # What Could Be Done? .font130[ - While not the "solution", tools and systems can help: - compatibility intervals - surprisal/s-values - understanding the role of cognitive biases and heuristic tools - cognitive forcing tools - CONSORT, PRECIS, STROBE, TARGET - These could be taught more prominently MPH and PhD programs. ] ??? So what can we do instead? Tools and methodologies are available to help us navigate and manage context: tools like compatibility intervals, surprisal or s-values, and an understanding of cognitive biases and heuristic tools, as well as cognitive forcing tools, such as the STROBE, TARGET, or CONSORT guidelines, can go a long way in helping us navigate complex settings. And so they should take a more central role in our MPH and PhD curricula. However, we need to recognize these tools for what they are: aids, tools, and assists. These tools should be presented alongside a deeper engagement with the mess of context that defines an area of science: things like substantive theory, or public health and clinical practice. --- # The Role of Epidemiologists - There can be "no mechanical alternative" to informed judgement <a name=cite-Falk1995></a><a name=cite-Gigerenzer1993></a>([Falk et al., 1995](#bib-Falk1995); [Gigerenzer, 1993](#bib-Gigerenzer1993)). - We must become experts at identifying and unpacking the contextual features of a given scientific problem, and develop strategies to account for them in our study designs and analyses. - This can best be done by merging expert methodology and substantive knowledge - Like John Snow, we must do careful detective work ??? More to the point, I believe that, at the core of what I presented for you today is a key takeaway: Quoting Ruma Falk and Charles Greenbaum, there can be no mechanical alternative to informed judgement. No single threshold, no universal rule, no algorithm or methodological tool can substitute for understanding the scientific context of the problem you are studying. This implies that we must become experts at identifying and unpacking the relevant contextual features of a scientific problem, and develop strategies to account for them. This includes understanding the treatment or exposure under study, its risk profile, the outcome being measured, the study design, and the potential threats to validity that arise from these details, and a whole host of other elements that can sometimes go unmentioned or are implicit in our application of scientific methods. In other words, we must do careful detective work that will sometimes go beyond the proscriptions of analytic and statistical methodology. This is fundamentally what I believe epidemiology is about, the original medical detectives. This is an aspect of our field and it's history that should not go unappreciated. This, I believe, can go a long way in balancing the debate about statistics and evidence in medical research. --- # The arXiv Post <a href="https://arxiv.org/abs/2604.02526" target="_blank"> <img src="images/arxiv_paper.png" style="width: 100%; max-height: 480px; object-fit: cover; object-position: top; border: 1px solid #444;"> </a> --- # Mice and Tigers .font150[... to find out why I call it "Mice and Tigers"] <div style="display: flex; align-items: flex-start; gap: 200px; margin-top: 100px;"> <iframe src="https://miceandtigers.substack.com/embed" width="480" height="320" style="border: 1px solid #EEE; background: white; flex-shrink: 0;" frameborder="0" scrolling="no"></iframe> <div style="display: flex; align-items: center; justify-content: center; height: 320px;"> <img src="images/qr_substack.svg" style="width: 350px; height: 350px;"> </div> </div> --- class: title-slide, left, bottom <span style="font-size: 50px;"> Applied Statistics Requires Scientific Context </span> <br> <span style="font-size: 35px;"> The Role of Epidemiologists in Statistics Reform </span> <br> <br><br><br><br><br><br><br><br> <br><br> <table style="border: none; border-collapse: collapse; margin-left: 0; margin-top: -50px; float: left;"> <tr style="border: none; background: transparent; line-height: 0.8;"> <td style="border: none; border-right: 1px solid #6ca3d9; padding: 2px 5px; background: transparent;"><strong>Ashley I Naimi, PhD</strong></td> <td style="border: none; padding: 2px 5px; background: transparent;">Dept of Epidemiology</td> </tr> <tr style="border: none; background: transparent; line-height: 0.8;"> <td style="border: none; border-right: 1px solid #6ca3d9; padding: 2px 5px; background: transparent;">Professor</td> <td style="border: none; padding: 2px 5px; background: transparent;">Emory University</td> </tr> </table> <div style="line-height: 1.2;">
<a href="mailto:ashley.naimi@emory.edu">ashley.naimi@emory.edu</a> <br>
<a href="https://ainaimi.github.io/">https://ainaimi.github.io/</a> </div> <img src="images/qr_code.svg" class="qr-code" alt="QR Code">