In my first post in this series, I proposed a “starter kit” for developing a statistical argument. This “starter kit” included considerations related to both the direction of effects (“[Is] the association directionally consistent with known mechanistic accounts (e.g. based on enzymology)?”) and the magnitude of effects (“Does the association have a magnitude that is consequential in some practical sense?”). In this post, I argue that “significance language” can play a valuable role in discussing the estimated direction of effects, even if supplementary approaches are needed to address more complex questions about magnitude.
In an oft-cited article published in 1991 in Statistical Science, John Tukey wrote:
All we know about the world teaches us that the effects of A and B are always different—in some decimal place—for any A and B. Thus asking “Are the effects different?” is foolish.
What we should be answering first is “Can we tell the direction in which the effects of A differ from the effects of B?” In other words, can we be confident about the direction from A to B? Is it ”up,” “down” or “uncertain”?
The third answer to this first question is that we are “uncertain about the direction”—it is not, and never should be, that we “accept the null hypothesis.”
On my reading, Tukey’s point here is not primarily that statisticians have used the wrong inferential techniques. Later in the article, he advocates forcefully and convincingly for the confidence interval as an especially meritorious technique, whose inferential value is unexchangeable with p-values. However, in these initial remarks, his essential point is not that statisticians have traditionally done the wrong thing, but rather that they have talked about what they have done in the wrong way. The importance of language comes into fuller resolution several paragraphs later, where he writes:
What of the analyst, who may even be a statistician, who says “This is all about words -- I may use the bad words, but I do always think the proper thoughts, and always act in the proper way!”
We must reject such a claim as quite inadequate.
Unless we learn to keep what we say, what we think, and what we do all matching one another … we will not serve ourselves, our friends, and our clients adequately.
As I argued in my initial post, the worst “crimes of statistical significance” arise not from the quality of our statistical methods, but from the quality of our statistical arguments. If, in discussing a large p-value, we say that we have “accepted the null hypothesis”, then we are likely to mislead. For this reason, many students of statistics are subjected to a bewildering catechetical exercise in which they are taught to “fail to reject” null hypotheses, as if that choice of phrasing could have any hope of clarifying matters for the statistically uninitiated. This trauma can be entirely avoided if we simply say that a small [large] two-sided p-value conveys that the “direction of the effect can [cannot] be inferred with confidence”. This is generally true because a single two-sided test of H0: m1 == m2 versus HA: m1 != m2 is equivalent to two one-sided tests carried out with correction for multiplicity (alpha → alpha / 2), one of H01: m1 <= m2 versus HA1: m1 > m2, and the other of H02: m1 >= m2 versus HA2: m1 < m2. In using this directional language, one at least does no harm and at best adds some value, though one could in all likelihood add more value, e.g. with the aid of a confidence interval (**).
With all that in mind, let me return to the example that I began to develop in my previous post. In that application (for reasons explained in that post), my report described six different logistic regression models (one per adverse event endpoint), each of which included at least 10 regression coefficients (roughly sixty parameters in total). I went about it in this way:
- In the prose section of my Results, I (selectively) called attention to specific parameter estimates that I would later require to support my Discussion. In these prose references, I merely cited the parameter estimate and indicated whether it was significant at the 0.05 level ( ⇐ The Scene of the Crime!!).
- Although I did not (for reasons that I elaborate below) offer confidence intervals “in line” in my Results, a 95% confidence interval was provided for every parameter in tables in the Appendix.
- In my Discussion section, I interpreted and contextualized what I had cited in Results. This included the caveat that I shared in my previous post (“Given the large number of relationships between predictors and endpoints evaluated in this analysis, isolated assessments of statistical significance should not generally be interpreted as conveying strong evidence.… “). It also included statements to the effect that “non-significant associations” were merely associations for which one could not, at present, determine the direction of the effect ("absence of evidence is not evidence of absence"). For select parameters of interest, I provided contextualization of the magnitude with the aid of the confidence intervals that had been provided in the Appendix.
The net effect of that strategy was that I discussed the vast majority of parameters in terms of directional inference only. I did not grapple meaningfully with questions of magnitude for most parameters because, quite simply, it would have been a lot of work. This is not merely a matter of laziness, but also of sympathy for the reader. For any nonlinear model (including any generalized linear model), expressing magnitudes in a meaningful way requires discussion of predicted values of the endpoint of interest, and not merely parameter estimates. Moreover, for any parameter (even in a linear model), it is no easy matter to establish what a practically consequential magnitude might be. When those battles need to be fought, then fight them. However, in at least some cases, especially when the direction of an effect cannot be confidently established, I find it wise to save the reader’s energy for numbers that he or she will find actionable.
This brings me to my final point:
Insofar as one merely wishes to suggest whether the direction of an effect can be inferred with confidence, a statement in the Discussion section regarding statistical significance at some conventional level of confidence (0.05, let’s say) is perfectly acceptable. By simply referring the result to a conventional reference point, “significance language” has the merit of not distracting the reader with the numerology of excessive p-values and confidence intervals. Of course that cannot be the end of the story. Of course the evidentiary value needs to be discounted if the result arose in an exploratory context. But, if one’s goal is to first render high-level assessments in intelligible and unexcessive prose (to be followed, where necessary, by substantial elaboration), “significance language” really isn't so bad.
In summary, while it is true that we should not exalt determinations of statistical significance, while it is true that we should relegate statistical significance to the realm of Results and not Conclusions, nonetheless it would be silly to deprive ourselves of the language of statistical significance altogether. It is a language that can help us stumble through the half-lit world of directional inference until we see more clearly the objects whose magnitudes we want to describe.
(*) At least, this is true of most unidimensional hypothesis tests, e.g. tests of whether a single contrast m1 - m2 is equal to zero; a similar approach can be taken in higher dimensions through application of the partitioning principle.
(**) Gelman and Carlin make use of a similar distinction in the Bayesian framework, that of “Type-S” and “Type-M” errors.