class: title-slide, left, bottom <span style="font-size: 50px;"> Applied Statistics Requires Scientific Context </span> <br> <span style="font-size: 35px;"> Why Statistics Reform Needs Epidemiologic Thinking </span> <br> <span style="font-size: 35px;"> Statistics Reform or Science Reform? </span> <br> <br><br><br><br><br><br><br><br> <br><br> <table style="border: none; border-collapse: collapse; margin-left: 0; margin-top: -50px; float: left;"> <tr style="border: none; background: transparent; line-height: 0.8;"> <td style="border: none; border-right: 1px solid #6ca3d9; padding: 2px 5px; background: transparent;"><strong>Ashley I Naimi, PhD</strong></td> <td style="border: none; padding: 2px 5px; background: transparent;">Dept of Epidemiology</td> </tr> <tr style="border: none; background: transparent; line-height: 0.8;"> <td style="border: none; border-right: 1px solid #6ca3d9; padding: 2px 5px; background: transparent;">Professor</td> <td style="border: none; padding: 2px 5px; background: transparent;">Emory University</td> </tr> </table> <div style="line-height: 1.2;">
<a href="mailto:ashley.naimi@emory.edu">ashley.naimi@emory.edu</a> <br>
<a href="https://ainaimi.github.io/">https://ainaimi.github.io/</a> </div> <img src="images/qr_code.svg" class="qr-code" alt="QR Code"> --- # Acknowledgements .font150[ - This work benefitted from discussion with several colleagues and friends: - Enrique Schisterman, Sunni Mumford, Stephanie Hinkle, Alexander Levin, and Eric Tchetgen Tchetgen at UPenn - Larry Wasserman at CMU - Stephen R. Cole at UNC - Sander Greenland at UCLA - Please feel free to report errors and send comments to me: [ashley.naimi@emory.edu](mailto:ashley.naimi@emory.edu) ] --- # My Goals with This Talk .middle-content[ .font150[ 1) Distinguish two types of scientific "context" often invoked in statistics 2) Argue that significance thresholds are inherently flawed 3) Convince you that we (epidemiologists/biostatisticians) have an important role to play in statistics (science) reform ] ] --- # Overview .font120[ * Background: The Mis-Use of Statistics * Statistics and "Context" * P Values — Divergence and Decision * Significance Thresholds and Context Considerations in Practice - Low Dose Aspirin and Pregnancy Loss - Tofacitinib and Ankylosing Spondylitis * Context Considerations and Validity Threats * A Call for Reform <!-- - Applied statsitics needs epidemiologic thinking. start with the main point that Tukey makes in 1962: that "data science" should not be mathematical A goal of mathematical statisticians: develop and evaluate data analysis methods that are generally applicable, and thus decontextualized. This is important, and ... --> ] --- <br><br><br><br> .font150[ > ... there is no royal road to statistical induction. .right[— Jacob Cohen (1990) *Am Psychol* 45(12):1304-1312] ] --- # Background: The Mis-Use of Statistics Statistical methods play a central role in science. But, ... -- <img src="images/s1.png" class="random-img-1" style="width: 500px;"> <img src="images/s2.png" class="random-img-2" style="width: 500px;"> <img src="images/s3.png" class="random-img-3" style="width: 500px;"> <img src="images/s4.png" class="random-img-4" style="width: 500px;"> <img src="images/s5.png" class="random-img-5" style="width: 500px;"> <img src="images/s6.png" class="random-img-6" style="width: 500px;"> <img src="images/s7.png" class="random-img-7" style="width: 500px;"> <img src="images/s8.png" class="random-img-8" style="width: 500px;"> <img src="images/s9.png" class="random-img-9" style="width: 500px;"> <img src="images/s10.png" class="random-img-10" style="width: 500px;"> <img src="images/s11.png" class="random-img-11" style="width: 500px;"> <img src="images/s12.png" class="random-img-12" style="width: 500px;"> <img src="images/s13.png" class="random-img-13" style="width: 500px;"> <img src="images/s14.png" class="random-img-14" style="width: 500px;"> ??? Statistical methods play a central role in science. However, there has long been a vigorous discussion and debate on the relationships between statistical methods and the generation and validation of scientific knowledge. This discussion has focused on a number of topics, including: - how p values and other inferential measures are used and taught (Wasserstein and Lazaar 2016, Rafi and Greenland 2020, Alawbathani et al 2021) - the interpretation of p values and related metrics such as s values, and compatibility metrics (Cole et al 2021, Greenland 2019a, Greenland 2022a, Greenland 2023, Greenland 2023a) - their relation to standards of evidence (Peskun2020, vanZwet2023, Gibson2021) - the selection and utility of p value thresholds (McShane2019, Benjamin2018, Hemerik2025, Maier2022) - the impact of heuristic tools and cognitive biases in the interpretation of statistical/scientific results (Rafi2020a) - the role of alternative frameworks, such as Bayesian decision making and inferential systems or, more recently, an alternative to the p value referred to as the "e value" (Goodman1995a, Bickel2022, Ramdas2025) --- # Background: Applied Statistics and Scientific Context .font130[ - Major theme of this literature: "scientific context" - However, "scientific context" is an imprecise, undefined term. - This has led to important ambiguities, obfuscating precisely *how* one might act on various calls for statistics reform. - scientific/clinical context is often used in one of two distinct ways: - Foundational: nebulous and elusive since this context will depend on the nature of the scientific question and the nuances of the area being studied, and will often be presented as ``subjective.'' - Quantitative: more superficial, easier to understand, often implicates the outputs of statistical methods applied to data. ] ??? A major theme of this discussion has been on the role that scientific context must play when using statistical methods. However, scientific context is an imprecise term with no generally accepted sharp definition. This has led to an important ambiguity in how this term is used in the literature, obfuscating precisely how one might act on various calls for statistics reform. In relation to the use and application of statistical methods, scientific or clinical context is often used in one of two distinct ways. The first is foundational, but nebulous and elusive since this context will depend in unpredictable ways on the nature of the scientific question and the nuances of the area being studied, and will often be presented as subjective. The second is more superficial and easier to understand, and is usually quantitative in nature. Its often based on the outputs of statistical methods applied to data. --- # Background: Applied Statistics and <span style="color: red;">**Foundational**</span> Context .font150[ There is an extensive literature dating back to at least the early 20th century on the role that the more nebulous version scientific context must play in the use of statistical methods. - **Jerzy Neyman**: on NHSTs, error thresholds should be selected "[a]ccording to the circumstances and ... subjective attitudes of the research worker." <a name=cite-Neyman1977></a>([Neyman, 1977](#bib-Neyman1977), p104) - **John Tukey**: results of a data analysis depend on *judgements* that arise in an attempt to mitigate the complexity of a particular experimental scenario. <a name=cite-Tukey1962></a>([Tukey, 1962](#bib-Tukey1962), p9, p46) ] <div class="footnote"> <p><cite><a id='bib-Neyman1977'></a><a href=>Neyman, J.</a> (1977). In: <em>Synthese</em> 36.1, pp. 97–131.</cite></p> <p><cite><a id='bib-Tukey1962'></a><a href=>Tukey, J. W.</a> (1962). In: <em>The Annals of Mathematical Statistics</em> 33.1, pp. 1–67.</cite></p> </div> ??? There is an extensive literature dating back to at least the early 20th century on the role that the more nebulous version scientific context must play in the use of statistical methods. Several examples can be listed. Jerzy Neyman, in discussing the use of his significance testing framework, noted that error thresholds should be selected "[a]ccording to the circumstances and ... subjective attitudes of the research worker." John Tukey, in what might be considered the founding manifesto of modern data science, famously argued that the results of a data analysis depend on judgements that arise in an attempt to mitigate the complexity of a particular experimental scenario. --- # Background: Applied Statistics and <span style="color: red;">**Foundational**</span> Context .font150[ - **David Freedman**: The role that "shoe leather"—careful detective work conducted by painstakingly walking the streets of London to collect facts and validate assumptions—played in John Snow's work <a name=cite-Freedman1991></a>([Freedman, 1991](#bib-Freedman1991), p299) - **Philip Stark**: "[w]hat claims and appears to be 'science' may be a mechanical amplification of the opinions and *ad hoc* choices" that arise as the result of and are shaped by the specific contextual details of a scientific question. <a name=cite-Stark2022a></a>([Stark, 2022a](#bib-Stark2022a)) ] <div class="footnote"> <p><cite><a id='bib-Freedman1991'></a><a href=>Freedman, D. A.</a> (1991). In: <em>Sociological Methodology</em> 21, pp. 291–313.</cite></p> <p><cite><a id='bib-Stark2022a'></a><a href=>Stark, P. B.</a> (2022a). In: <em>Pure and Applied Geophysics</em> 179.11, pp. 4121–4145.</cite></p> </div> ??? David Freedman documented the role that "shoe leather" played in leading John Snow to end the cholera outbreak he was studying in the 1850s. By "shoe leather", he meant the careful detective work conducted by painstakingly walking the streets of London to collect facts and validate assumptions. Freedman argued that Snow's "brilliant detective work on nonexperimental data'' was impressive not because of the statistical techniques used, but because of "the handling of the scientific issues." Finally, In discussing the importance of models, Philip Stark notes that "[w]hat claims and appears to be 'science' may be a mechanical amplification of the opinions and ad hoc choices ... which lacks any tested, empirical basis." These "opinions and ad hoc choices" arise as the result of and are shaped by the specific contextual details of a scientific question. Several other similar such illustrations can be found throughout the scientific literature. --- # Background: Applied Statistics and <span style="color: red;">**Quantitative**</span> Context .font150[ - **Rebecca Betensky**: The interpretation of p-values "must be considered in the context of certain features of the design and substantive application, such as sample size and meaningful effect size." <a name=cite-Betensky2019></a>([Betensky, 2019](#bib-Betensky2019), p115) - **Roychoudhury et al**: Propose dual-criterion phase II clinical trial design that incorporates statistical significance with "clinical relevance", defining the latter as a "sufficiently large effect estimate." <a name=cite-Roychoudhury2018></a>([Roychoudhury et al., 2018](#bib-Roychoudhury2018), p452) - **Minimal Clinically Important Differences**: A good entry article is <a name=cite-McGlothlin2014></a>([McGlothlin et al., 2014](#bib-McGlothlin2014)) ] <div class="footnote"> <p><cite><a id='bib-Betensky2019'></a><a href=>Betensky, R. A.</a> (2019). In: <em>The Am Stat</em> 73.1, pp. 115–117.</cite></p> <p><cite><a id='bib-Roychoudhury2018'></a><a href=>Roychoudhury, S. et al.</a> (2018). In: <em>Clinical Trials</em> 15.5, pp. 452–461.</cite></p> <p><cite><a id='bib-McGlothlin2014'></a><a href=>McGlothlin, A. E. et al.</a> (2014). In: <em>JAMA</em> 312.13, pp. 1342–1343.</cite></p> </div> ??? In contrast, there are many who define scientific or clinical context differently. For instance, in an article entitled "The P Value Requires Context, Not a Threshold," Rebecca Betensky argued that the interpretation of p-values "must be considered in the context of certain features of the design and substantive application, such as sample size and meaningful effect size." Roychoudhury et al proposed a dual-criterion phase II clinical trial design that incorporates statistical significance with "clinical relevance", defining the latter as no more than a "sufficiently large effect estimate." Of course, both of these are connected to a much broader literature on "minimal important differences," which reflect the smallest magnitude change in a study's outcome that would be considered "important" based on clinical or scientific context --- # Background: Applied Statistics and Scientific Context .font150[ - These two notions of scientific context are connected. - Features often described as "context" in the sharper sense will arise out of the more nebulously defined "context." - Foundational context is, consequently, more important - (But: I am not advocating for an abandonment of sharper contextual considerations) ] ??? Importantly, these two notions of scientific context--foundational and quantitative--are connected. Features of study design and effect magnitudes often ascribed as "context" in the sharper sense will often arise out of aspects of the more nebulously defined notion of context. Consequently, foundational context is more important, since quantitative context depends on it Note, however, that I am not advocating for an abandonment of sharper contextual considerations. I believe these are an essential part of good science. What I am arguing is that we must understand where these sharper contextual considerations come from. --- # The P Value .center[ .middle-content[ .font150[ The probability of observing a result as extreme or more extreme than the one we observed if the test hypothesis was true, <span style="color: red;">and the underlying statistical model is correct.</span> ] ] ] ??? Now, rather than repeating that scientific context is nebulously defined and elusive, I want to try to provide some more concrete insights. To do that, I will start by presenting a commonly used statistical tool, the p value. Furthermore, let's start by defining the p value in a format that is considered technically correct, but that has important limitations. Up until very recently, this definition of the p value is what I would teach in my courses, and it's what I was taught here at UNC as a doctoral student a couple of years ago. In this definition, we say that "the p value is the probability of observing a result as extreme or more extreme than the one we observed if the test hypothesis was true." I often make it a point to add and emphasize the additional consideration that the "underlying statistical model is correct" and that this so-called "underlying statistical model" is not referring to just the, e.g., logistic regression model we are using to conduct our analysis. Inevitably, students ask my what else there is, and I present a hodge-podge list of items that depends on a lot of things but that can involve, possibly, the assumption of independent and identically distributed observations, no measurement error, no confounding, as well as many other items. Though correct, one reason why I think this definition of the p value is insufficient is that it is all too easy to either ignore the condition that our "the underlying statistical model is correct", or to simply mention it without actually understanding how important this condition is. --- # The P Value: Divergence Interpretation .font120[ - The `\(p\)`-value: a quantile (or percentile) location measure of divergence between the actual data `\(z\)`, and what we'd expect the data to look like <span style="color: red;">under certain conditions/assumptions</span> <a name=cite-Greenland2023a></a><a name=cite-Perezgonzalez2015></a>([Greenland, 2023b](#bib-Greenland2023a); [Perezgonzalez, 2015](#bib-Perezgonzalez2015)). - Let's call these <span style="color: red;">conditions/assumptions</span> `\(M\)` - Suppose interest lies in the ITT effect in a double-blind placebo controlled trial: `$$\psi_{ITT} = E \left (Y^{a = 1} \right ) - E \left (Y^{a = 0} \right )$$` - `\(Y^a\)`: outcome that would be observed if an individual was randomized to treatment arm `\(A = a\)` ] <div class="footnote"> <p><cite><a id='bib-Greenland2023a'></a><a href=>Greenland, S.</a> (2023b). In: <em>Scandinavian Journal of Statistics</em> 50.1, pp. 54–88.</cite></p> <p><cite><a id='bib-Perezgonzalez2015'></a><a href=>Perezgonzalez, J. D.</a> (2015). ”. In: <em>Frontiers in Psychology</em> 6.</cite></p> </div> ??? Following prior work,\cite{Greenland2023a, Perezgonzalez2015} the P value can be defined as a quantile or percentile location measure of divergence, distance, or separation between the data we collected, let's call this `\(z\)`, and what we'd expect these data to look like if a set of conditions or assumptions were true. For convenience, let's call the set of conditions and assumptions `\(M\)`. Let's say we're interested in estimating the intention to treat effect in a double blind placebo controlled trial, which we can define using potential outcomes. *** More technically, `\(M\)` is a model manifold that lies in the `\(n\)`-dimensional expectation space defined by the data generating mechanism that produced `\(z\)`. Importantly, this manifold `\(M\)` is a subset of the `\(Z\)`-space into which the conjunction of the model constraints (assumptions) and test hypothesis force the data expectation or predict where `\(z\)` would be were there no random variability. This manifold is the product of the \emph{logical conjunction} of the data and elements in `\(M\)`. I note here (and discuss later) that these assumptions and conditions depend heavily on the target parameter of interest and the study context and design. They are often broad, sometimes implicit, and mostly hard to verify or guarantee in practice. --- # The P Value: Divergence Interpretation .font150[ `$$\psi = E \left (Y^{a = 1} \right ) - E \left (Y^{a = 0} \right )$$` - This ITT effect can be estimated with our data under conditions/assumptions that constitute elements of `\(M\)`, such as: - (`\(M_1\)`) randomization "worked" - (`\(M_2\)`) blinding was maintained - (`\(M_3\)`) any loss to follow-up is MCAR ] ??? This ITT effect can be estimated with our data (or, is identified) under conditions or assumptions that constitue elements of M, which include things like the assumption that randomization worked (i.e., that the distribution of all covariates across the treatment group are "balanced"), that blinding was maintained or that any unblinding is inconsequential, that any loss to follow-up in the trial can be classified as, for example, missing completely at random. Under these assumptions, and the test hypothesis that we will choose, we can construct a p value for the ITT effect in this trial --- # The P Value: Divergence Interpretation .font150[ - Let's construct a `\(p\)`-value for this trial - We first add to our assumptions `\(M\)` our test hypothesis (the null), so that: - `\(M = \{ M_0: \psi = 0, M_1, M_2, M_3\}\)` - We then take the sample mean of the outcome in each treatment arm: `\(z = (\bar Y_1,\bar Y_0)\)` - This new `\(M\)` implies: `\(\hat{\psi} = \bar Y_1 - \bar Y_0\)` should be zero ] ??? We first add to our set of assumptions the chosen test hypothesis value, which in our case will be the null. This null value, when combined with the other conditions/assumptions in M, implies that the difference in estimated treatment specific means should be zero. Importantly, it's not just `\(M_0\)` that implies that our sample means should be equivalent. It is the entire set of assumptions in M that implies the difference in sample means will be zero. --- # The P Value: Divergence Interpretation <img src="images/p_value_combined.png" width="100%" style="display: block; margin: auto;" /> ??? FOR THE LEFT: - POINT OUT THE X AND Y AXIS - POINT OUT THE DIAGONAL - EXPLAIN THAT Z REPRESENTS THE SAMPLE ESTIMATE - EXPLAIN THAT D REPRESENTS HOW "FAR" THE SAMPLE ESTIMATE IS FROM THE DIAGONAL FOR THE RIGHT: - WE CAN CONVERT D INTO A DISTRIBUTION THAT TELLS US THE EXPECTED FREQUENCIES OF DIFFERENT D'S UNDER THE SET OF CONDTITIONS M - THE AREA UNDER THIS DISTRIBUTION CURVE CAN THEN BE USED TO MEASURE HOW "COMPATIBLE" OUR DATA ARE WITH ALL THE CONDITIONS IN M --- # The P Value: Divergence Interpretation <img src="images/pval_geometry.gif" width="100%" style="display: block; margin: auto;" /> ??? This figure shows that when p is 1, the data we observed are perfectly compatible with the assumptions in M, including our test --- # The P Value: Divergence Interpretation .font150[ - `\(p\)` measures the "consonance/divergence" between `\(z\)` and `\(M\)`. - Larger "divergences" yield smaller `\(p\)` → incompatibility between the <span style="color: red;">set of conditions (all of them) in `\(M\)`</span> and `\(z\)`. ] ??? What's useful about this definition of the p value is that it forces us to interpret the result we get as a measure of compability between the data AND the set of assumptions that must arise out of the scientific context of the study we are doing. This is in contrast to the alternative definition which, though technically correct, facilitates relegating these critical assumptions to the background. --- # The P Value: Decision Interpretation .font130[ - `\(p\)`-value: output of a decision criterion about whether or not to reject the model `\(M\)` as plausible model for the data generating mechanism. - Conventional approach: `\(\alpha\)` (Type I error), often 0.05; `\(\beta\)` (Type II error), often 0.2. - Two Problems: - 1) Error rates were intended to be chosen "[a]ccording to the circumstances and ... subjective attitudes of the research worker" ([Neyman, 1977](#bib-Neyman1977), p104). But standard thresholds are used even when these numerical values conflict with scientific aspects of the question under study. - 2) When `\(p < \alpha\)`, researchers are often compelled to reject the test hypothesis in isolation. But interpretation is only valid if one accepts as true all other elements of the model `\(M\)`. ] <div class="footnote"> <p><cite><a id='bib-Neyman1977'></a><a href=>Neyman, J.</a> (1977). In: <em>Synthese</em> 36.1, pp. 97–131.</cite></p> </div> ??? A second framework we can use to interpret the p value is that it is the output of a decision criterion about whether or not to reject the set `\(M\)` as a plausible model that generated the data. If we follow convention, we'd elect to reject `\(M\)` if the divergence p value was less than our alpha threshold of 0.05. But there are two problems with this approach: First, error rates were intended to be selected [a]ccording to the circumstances and ... subjective attitudes of the research worker. Unfortunately, standard thresholds are used even when these numerical values conflict with the scientific aspects of the study we want to conduct. Second, researchers are often compelled to reject the test hypothesis in isolation. But this interpretation is only valid if one accepts as true all of the other elements of the set of condiitions/assumptions we denoted `\(M\)`. In the remainder of this talk, I am going to provide you with some concrete examples demonstrating why the use of p value thresholds is problematic, and wrap up with a discussion of what we can do. --- # Context Considerations in Practice: The EAGeR Trial .font150[ - Low Dose Aspirin and Pregnancy Loss: EAGeR <a name=cite-Schisterman2014></a>([Schisterman et al., 2014](#bib-Schisterman2014)). - Motivation: unexplained recurrent miscarriage may be attributable to underlying inflammation <a name=cite-Silver2007></a>([Silver et al., 2007](#bib-Silver2007)). - Aspirin: anti-inflammatory, in use for over a century, affordable, well-known low risk profile. - LDA in use for nearly a decade to treat unexplained recurrent miscarriage, even though evidence of this effect was lacking. ] <div class="footnote"> <p><cite><a id='bib-Schisterman2014'></a><a href=>Schisterman, E. F. et al.</a> (2014). In: <em>Lancet</em> 384.9937, pp. 29–36.</cite></p> <p><cite><a id='bib-Silver2007'></a><a href=>Silver, R. M. et al.</a> (2007). In: <em>Clinical Obstetrics</em>. John Wiley & Sons, Ltd. Chap. 11, pp. 141–160.</cite></p> </div> ??? Let's start our practical discussion with the Effects of Aspirin on Gestation and Reproduction (EAGeR) Trial, a multicenter double-blind placebo controlled trial of the effect of daily low-dose (81 mg) aspirin on live birth outcomes among women who were trying to conceive, but who had experienced one or two prior pregnancy losses. The trial was motivated by the fact that unexplained recurrent miscarriage may be attributable to underlying inflammation. Aspirin is an anti-inflammatory drug that has been in use for over a century in the general population, it is affordable, and its side-effects are well understood and relatively low-risk. And critically, at the time when the trial was conducted, low dose aspirin had been used for nearly a decade in clinical settings to treat unexplained recurrent miscarriage, even though evidence of its effect was lacking. The EAGeR Trial was conducted to fill this evidence gap. --- # Context Considerations in Practice: The EAGeR Trial .font150[ - Aspirin use as SOP establishes a high tolerance for type I error. - Power calculations for the EAGeR trial were conducted for a 10% absolute difference live birth (two-sided `\(\alpha = 0.05; \beta = 0.20\)`) ([Schisterman et al., 2014](#bib-Schisterman2014)). - However, powering the trial using a higher `\(\alpha\)` threshold could have accomplished same objectives. - Per NIH Reporter, a total of roughly $10M was spent on EAGeR. Per participant, this is about $8.1K. Reducing sample size by 200 would have resulted in an estimated savings of roughly $1.6M. ] <div class="footnote"> <p><cite><a id='bib-Schisterman2014'></a><a href=>Schisterman, E. F. et al.</a> (2014). In: <em>Lancet</em> 384.9937, pp. 29–36.</cite></p> </div> ??? This scientific context matters for study design and analysis. Indeed, clinical use of aspirin to treat unexplained recurrent miscarriage in the absence of direct evidence of an effect speaks to a high tolerance for type I error. Power calculations for the EAGeR trial were conducted for a 10\% absolute difference in the probability of live birth on the basis of the standard thresholds (two-sided `\(\alpha = 0.05; \beta = 0.20\)`). However, powering the trial using a higher type I error rate could have resulted in a smaller sample size, and significant cost savings Per NIH Reporter, a total of roughly \$10M was spent on EAGeR. Rough calculations suggest that reducing sample size by 200 would have resulted in a savings of about \$1.6M. --- # Context Considerations in Practice: Tofacitinib and AS .pull-left-a-lot[ .font130[ - Contrast EAGeR with a phase III trial of <span style="color: red;">tofacitinib</span> for ankylosing spondylitis (AS) <a name=cite-Deodhar2021></a>([Deodhar et al., 2021](#bib-Deodhar2021)). - AS: arthritic condition that results in inflammation/fusing of the spinal column. - Tofacitinib is an inhibitor of the Janus Kinase (JAK) pathway system, implicated in several autoimmune disorders closely related to AS (rheumatoid and psoriatic arthritis). ] ] .pull-right-a-little[ <img src="images/as.jpg" width="100%" style="display: block; margin: auto;" /> ] <div class="footnote"> <p><cite><a id='bib-Deodhar2021'></a><a href=>Deodhar, A. et al.</a> (2021). In: <em>Annals of the Rheumatic Diseases</em> 80.8, pp. 1004–1013.</cite></p> image source: https://is.gd/fZGNjx </div> ??? In contrast to the EAGeR Trial, consider a phase III randomized double-blind placebo controlled trial of the effect of tofacitinib on ankylosing spondylitis (AS) among patients who have inadequately responded to or are intolerant of standard first line treatments (e.g., NSAIDs).\cite{Deodhar2021} Ankylosing spondylitis is an arthritic condition that results in inflammation and fusing of the spinal column, leading to potentially severe pain and immobility. As a therapeutic agent, tofacitinib is an inhibitor of the Janus Kinase (JAK) pathway system, which has been implicated in several autoimmune disorders closely related to AS, including rheumatoid and psoriatic arthritis. --- # Context Considerations in Practice: Tofacitinib and AS .font150[ - Patients randomized 1:1 to receive 5mg tofacitinib or placebo twice daily for 16 weeks of follow-up. - Primary endpoint: Assessment of SpondyloArthritis international Society ≥20% improvement (ASAS20) score (self reported change in condition score). ] <div class="footnote"> <p><cite><a id='bib-Deodhar2021'></a><a href=>Deodhar, A. et al.</a> (2021). In: <em>Annals of the Rheumatic Diseases</em> 80.8, pp. 1004–1013.</cite></p> image source: https://is.gd/fZGNjx </div> ??? Patients in the trial were randomized 1:1 to receive 5mg tofacitinib or placebo twice daily for 16 weeks of follow-up. The primary study endpoint was a self reported change score, referred to as the ASAS20 score. --- # Context Considerations in Practice: Tofacitinib and AS .font130[ - However, JAK inhibitors are much newer (2011): Long-term risk profile is unknown - Additionally, known risk profile for JAK inhibitors is far more severe: - serious infections (pneumonia, nasopharyngitis, UTIs, cellulitis, herpes zoster) - cardiovascular disease - cancer - GI perforations - anemia - liver conditions - Power calculations for the trial were conducted for a 20% absolute difference in ASAS20 for a two-sided `\(\alpha\)` threshold of 0.05. - However, in this case, a much lower tolerance for type I error could have been warranted, given the unknown long-term effects and serious risk profile. ] ??? Unlike aspirin, JAK inhibitors represent a much newer class of drugs, first introduced in 2011 (making their long term risks much harder to ascertain). Additionally, the known risk profile for JAK inhibitors is far more severe, and includes serious infections, cardiovascular disease, cancer, gastrointestinal perforations, anemia, and liver conditions. Power calculations for the trial were conducted for a 20\% absolute difference in ASAS20 after 16 weeks of follow up, and suggested that 120 participants per treatment arm would yield a power of 90\% for a two-sided `\(\alpha\)` threshold of 0.05. However, in this case, a much lower tolerance for type I error would have been warranted, given the unknown long-term effects and serious risk profile. So here, in contrasting these two examples we see why it's important to recognize that, if we indeed elect to use them, error rates should be selected on the basis of key contextual details of the scientific question at hand. On the one hand, studying the effects of aspirin on pregnancy loss admits to a much higher tolerance for type I error On the other hand, one could frankly argue that a type I error rate of 5% is too high for a drug like tofacitinib However, this is just the beginning of the issues here *** In this example I do not mention that an altogether different approach should have been used here: Standard clinical practice was to first use NSAIDs, and if the patient did not respond, to transition to anti-TNF drugs, which has a similar risk profile to JKIs, but one that clinicians are more familiar with. A best practices approach here would have been to conduct a non-inferiority trial to compare tofacitinib to the anti-TNF regime. This would have helped with the potential unblinding issues I highlight below. --- # Context Considerations and Validity: Tofacitinib and AS .font140[ - Significance tests are not immune to validity threats (e.g., systematic biases) <a name=cite-Greenland2016></a>([Greenland et al., 2016](#bib-Greenland2016)), even if one is willing to adapt the threshold to a particular scientific context. These threats affect both randomized trials and observational studies. - Tofacitinib is known to result in short-term dose-dependent changes in lipid concentrations, liver enzymes, creatine kinase, and blood counts → <span style="color: red;">functional unblinding</span> <a name=cite-VanderHeijde2017></a>([van der Heijde et al., 2017](#bib-VanderHeijde2017)). → patients in the treatment group could have reported better outcomes due to an <span style="color: red;">expectancy effect</span> <a name=cite-Huneke2025></a>([Huneke et al., 2025](#bib-Huneke2025)). ] <div class="footnote"> <p><cite><a id='bib-Greenland2016'></a><a href=>Greenland, S. et al.</a> (2016). In: <em>European Journal of Epidemiology</em> 31.4, pp. 337–350.</cite></p> <p><cite><a id='bib-VanderHeijde2017'></a><a href=>der Heijde, D. van et al.</a> (2017). In: <em>Annals of the Rheumatic Diseases</em> 76.8, pp. 1340–1347.</cite></p> <p><cite><a id='bib-Huneke2025'></a><a href=>Huneke, N. T. M. et al.</a> (2025). In: <em>JAMA Psychiatry</em> 82.5, pp. 531–538.</cite></p> </div> ??? Considering again the trial on tofacitinib, both patients and clinicians were blinded to treatment assignment. However, tofacitinib is known to result in short-term dose-dependent changes in lipid concentrations, liver enzymes, creatine kinase, and blood counts. Results of tests for these markers would lead to **functional unblinding** of clinicians and (if reported to them) patients. Indeed, a phase II trial of tofacitinib on AS noted dose-dependent changes in laboratory outcomes, but did not report how many patients experienced these changes, nor whether these changes were reported to the participants. Notably, because the primary outcome was a subjective measure of self reported improvement (ASAS20), it is possible that, for example, unblinded patients in the treatment group reported better outcomes due to an expectancy effect, where patients expect to be better, and thus report an improvement. --- # Context Considerations and Validity: Tofacitinib and AS .font150[ - Expectancy effects are not "biases" <a name=cite-Mansournia2017a></a>([Mansournia et al., 2017](#bib-Mansournia2017a)), but would threaten the validity of a study on a drug like tofacitinib. - Indeed, if expectancy effects overwhelm the physiological effects of tofacitinib, stricter testing procedures will only lead to stronger evidence for the wrong hypothesis, an issue often referred to as Type III error ([Stark, 2022a](#bib-Stark2022a)). ] <div class="footnote"> <p><cite><a id='bib-Mansournia2017a'></a><a href=>Mansournia, M. A. et al.</a> (2017). In: <em>Epidemiology (Cambridge, Mass.)</em> 28.1, pp. 54–59.</cite></p> <p><cite><a id='bib-Stark2022a'></a><a href=>Stark, P. B.</a> (2022a). In: <em>Pure and Applied Geophysics</em> 179.11, pp. 4121–4145.</cite></p> </div> ??? Such expectancy effects, though not technically biases,\cite{Mansournia2017a} would threaten the validity of a study on a drug like tofacitinib. If perceived improvements in AS result from an expectancy effect of being on the drug, and not the physiologic effects of the drug itself, patients and clinicians should wonder whether a drug with a risk profile like tofacitinib's is worth it. Indeed, if expectancy effects overwhelm the physiological effects of tofacitinib, stricter testing procedures will only lead to stronger evidence for the wrong hypothesis, otherwise known as a Type III error. --- <img src="images/s13a.png" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; object-fit: contain; filter: invert(1);"> <div style="position: relative; z-index: 10;"> -- <img src="images/speed_limit.png" width="40%" style="display: block; margin: auto;" /> </div> ??? So, when cast in the light of these contextual details, it's easy to see why calls like the one that Benjamin and the many co-authors of this 2018 call to lower the threshold of statistical significance from 0.05 to 0.005 are problematic. Though in mathematical and decontextualized settings, there may be some compelling reasons to do so (Benjamin et al use Bayesian arguments to make their mathematically compelling case) In applied settings, context matters, a lot. It's one reason why we've never heard arguments for a single nation-wide speed limit consisting of a weighted average of all speed limits across the country. It simply doesn't work. --- # A Call for Reform: What Should Be Done? .font130[ - There can be "no mechanical alternative" to informed judgement <a name=cite-Falk1995></a><a name=cite-Gigerenzer1993></a>([Falk et al., 1995](#bib-Falk1995); [Gigerenzer, 1993](#bib-Gigerenzer1993)). - We must become experts at identifying and unpacking the contextual features of a given scientific problem, and develop strategies to account for them in our study designs and analyses. - This can best be done by merging expert methodology and substantive knowledge - While not the "solution", tools like - compatibility intervals - surprisal/s-values - understanding the role of cognitive biases and heuristic tools - These should be taught more prominently MPH and PhD programs. ] ??? I believe that, at the core of what I presented for you today is a key takeaway: Quoting Ruma Falk and Charles Greenbaum, there can be no mechanical alternative to informed judgement. No single threshold, no universal rule, no algorithm can substitute for understanding the scientific context of the problem you are studying. This means we must become experts at identifying and unpacking the relevant contextual features of a scientific problem, and develop strategies to account for them. This includes understanding the treatment or exposure under study, its risk profile, the outcome being measured, the study design, and the potential threats to validity that arise from these details, and a whole host of other elements that often go unmentioned or are implicit in our application of scientific methods. Tools and methodologies are available to help us navigate and manage context: tools like compatibility intervals, surprisal or s-values, and an understanding of cognitive biases and heuristic tools can go a long way in helping us navigate complex settings. And so they should take a more central role in our MPH and PhD curricula. However, for our teaching to be most effective, these tools should be presented alongside a deeper engagement with the scientific reasoning that gives them meaning. --- # A Call for Reform: What Can We Do? .font110[ - Epidemiology and "shoe leather" ([Freedman, 1991](#bib-Freedman1991)). - Snow's work was impressive "because of the handling of the scientific issues," not the statistical techniques used. - This is what epidemiology is: the application of informed, context specific judgement to questions about medical and population health. - Epidemiologists are in a strategically advantageous position to lead statistics reform - the discipline is built around recognizing and managing context and it's complexity. ] <div class="footnote"> <p><cite><a id='bib-Freedman1991'></a><a href=>Freedman, D. A.</a> (1991). In: <em>Sociological Methodology</em> 21, pp. 291–313.</cite></p> </div> ??? So why are epidemiologists well positioned to lead this kind of reform? At the heart of what I'm discussing is the need to recognize the importance of informed, context specific judgement to questions about medical and population health. David Freedman's famous "shoe leather" paper is a great illustration if this orientation, and the utility of "painstaking detective work". This is precisely what we as epidemiologists are trained to do. Our discipline is built around identifying threats to validity, assessing study designs, evaluating measurement strategies, and reasoning carefully about causal structures. I think we need to recognize an important history of our discipline has been to think about confounding, selection bias, measurement error, data quality, not JUST in abstract mathematical terms, but as messy, nuanced features of the scientific contexts. This puts us in a strategically advantageous position to speak to the importance of statistics reform. While much of the statistics reform literature has focused on mathematics and abstract methodology, but there is a deeper need for the kind of scientific reasoning that epidemiologic training provides. The reform we need is not merely statistical. It is scientific. And epidemiologists, by training and by tradition, are well equipped to lead it. --- class: title-slide, left, bottom <span style="font-size: 50px;"> Applied Statistics Requires Scientific Context </span> <br> <span style="font-size: 35px;"> Why Statistics Reform Needs Epidemiologic Thinking </span> <br> <span style="font-size: 35px;"> Statistics Reform or Science Reform? </span> <br> <br><br><br><br><br><br><br><br> <br><br> <table style="border: none; border-collapse: collapse; margin-left: 0; margin-top: -50px; float: left;"> <tr style="border: none; background: transparent; line-height: 0.8;"> <td style="border: none; border-right: 1px solid #6ca3d9; padding: 2px 5px; background: transparent;"><strong>Ashley I Naimi, PhD</strong></td> <td style="border: none; padding: 2px 5px; background: transparent;">Dept of Epidemiology</td> </tr> <tr style="border: none; background: transparent; line-height: 0.8;"> <td style="border: none; border-right: 1px solid #6ca3d9; padding: 2px 5px; background: transparent;">Professor</td> <td style="border: none; padding: 2px 5px; background: transparent;">Emory University</td> </tr> </table> <div style="line-height: 1.2;">
<a href="mailto:ashley.naimi@emory.edu">ashley.naimi@emory.edu</a> <br>
<a href="https://ainaimi.github.io/">https://ainaimi.github.io/</a> </div> <img src="images/qr_code.svg" class="qr-code" alt="QR Code">