In my previous post, I proposed a “starter kit” for developing a statistical argument. An essential ingredient in this “starter kit” was the identification of the context of inquiry as being confirmatory, explanatory, or something in between. In this post, I propose several criteria for defining a confirmatory context, and I suggest principles for using the language of “statistical significance” when confirmatory contexts do not obtain.
Recently, I worked on an exposure-response analysis of adverse events for a biologic agent under development for various hematological cancers. Prior dose-based analyses of these data had identified six specific adverse events of concern, but had not addressed a number of higher resolution concerns, including the merits of individualized dosing based on patient baseline characteristics. Given those higher resolution questions, an objective of our exposure-response analysis was to determine whether variation in adverse event incidence could be substantially explained by pharmacokinetic exposure metrics, immunological response to the therapeutic agent (patients varied in their propensity to develop antibodies that rendered the agent pharmacologically inert), and other baseline factors such as liver function (some adverse events were characteristic of impaired liver function).
The context of inquiry that I have described in the preceding paragraph was, in at least one (loose) sense, a “confirmatory” context. By this I mean that the analysis was motivated by questions that were, to a great degree, pre-specified. Specifically, it was expected that mechanistically plausible predictor variables would at least partially explain variation in the pre-specified response variables. This was not a context that involved any substantial variable selection, model selection, “fishing expeditions”, or “p-hacking” that would recognizably invalidate a confirmatory framework.
However, to obtain a fully confirmatory context, it is not enough that the questions should be specified in advance. Rules for adjudicating evidence to answer those questions must also be pre-specified. In the absence of formalized decision rules, it is impossible to define or (a fortiori) control meaningful error rates. In the absence of formal control of meaningful error rates, the notion of “statistical significance” loses its confirmatory impetus, not least because of the multiple comparisons problem. Whereas failure to pre-specify specific questions is analogous to drawing the dartboard on the wall after throwing the dart, the closely related problem of failing to account for multiple comparisons is analogous to drawing many dartboards on the wall prior to throwing the dart, but without specifying which target one is aiming for (in this application, we had at least one “dartboard” for each adverse event of interest). The connection between these two related problems is identified in the ASA statement:
One need not formally carry out multiple statistical tests for [the problem of multiplicity] to arise: Whenever a researcher chooses what to present based on statistical results, valid interpretation of those results is severely compromised if the reader is not informed of the choice and its basis.
(And, when the basis of such decisions is not pre-specified, it is easy to fool oneself -- or others, if one is nefarious -- as to the true basis of such decisions.)
In the applied context introduced above, the path from data to decision was not pre-specified. The analysis might have suggested that differences in adverse event rates were sufficient to merit individualized dosing based on baseline immune response, or it might have suggested that it was unnecessary; the considerations that might favor one course of action over another were manifold, and no a priori decision criteria were established to integrate these multiple lines of evidence. Accordingly, my Discussion began with the following caveat to alert the reader that the evidentiary value of “significant” findings should be discounted in this exploratory context:
Given the large number of relationships between predictors and endpoints evaluated in this analysis, isolated assessments of statistical significance [such as were identified in the Results] should not generally be interpreted as conveying strong evidence. Conversely, lack of statistical significance for any given association is not, in itself, sufficient to conclude that there is no causal or predictive relationship between the given predictor and the given response. Accordingly, the qualitative consistency of each predictor’s associations across multiple response endpoints, and the scientific plausibility of each putative effect, is considered in the discussion below, in addition to purely statistical criteria.
One might argue that -- in order to avoid such vague and post hoc reasoning -- some sort of formal decision criteria should have been established in advance, since a decision would clearly need to be made with respect to individualized dosing. To that I say, without prejudice: “maybe”. It would certainly be a daunting task to establish a formal decision framework that would be compelling to all stakeholders (including regulatory stakeholders). For present purposes I think it suffices to say that the cost-effectiveness of such an approach is debatable, and that in the event no such strategy was pursued.
In summary then: this was an essentially exploratory context. Or, at the very least, it was a “not-totally-confirmatory” context.
Bringing this full circle to the motivating question: what is one to make of “statistical significance” in such an exploratory context? A conscientious statistician might justifiably avoid using the language of “significance” at all in such situations. I hew to a different principle, one that is both more flexible and more demanding, and it is this:
Statistical Significance is a Result, Not a Conclusion.
There are many reasons why a statistically significant result may not warrant an associated conclusion. The exploratory nature of the context of inquiry is one such reason, but it is only one. Many factors should govern the translation of results into evidence, and all such need to be reviewed in a Discussion. Therefore, I propose that, rather than forswearing significance language altogether in reference to exploratory results, we instead evaluate its use according to this rubric:
Results are not meant to be Conclusions. Results, isolated from context, should not even be construed as evidence. Results may or may not take on the character of evidence in the course of a Discussion, and that evidence may or may not be sufficient to warrant specific Conclusions. As long as a statistical argument is structured in this way, “significance” language is harmless. As I will argue in my next post, such language may even provide some value, even in reference to exploratory results.
(*) In capitalizing and italicizing the words Results, Discussion, and Conclusions I mean to evoke the formal components of a conventional scientific argument (while at the same time conveying the more colloquial meanings).