*This is the first in a series of blog posts that will focus on practical problems of summarizing statistical evidence in the context of biomedical science, using the 2016 ASA statement on p-values and significance as a guide. I hope and expect that these posts will find a readership that does not entirely agree with my views in all of their particulars, and that this readership takes advantage of the blog / combox format to register both their agreements and disagreements. Merely good ideas are insufficient to the task at hand ; a shared understanding has to develop in the community of practice ; my hope here is to provide a forum for discussion for that community.*

In 2016, on behalf of the board of the American Statistical Association (ASA), Ronald Wasserstein issued a Statement on Statistical Significance and *P-*values in *The American Statistician*. The statement proposed six principles for the proper application and interpretation of *p*-values and statistical significance. The context of this statement was described in an accompanying article in which Wasserstein and co-author Nicole Lazar explained:

Let us be clear. Nothing in the ASA statement is new. Statisticians and others have been sounding the alarm about these matters for decades, to little avail. We hoped that a statement from the world's largest professional association of statisticians would open a fresh discussion and draw renewed and vigorous attention to changing the practice of science with regards to the use of statistical inference.

One might say that, whereas isolated voices have long endeavored to pluck certain weeds from the soil of statistical practice, this coordinated effort was intended to provide some industrial-strength tilling of that soil. This tilling, of course, cannot be the end of the story. It has now been appropriately left to the laborers in the various fields to see that something worthwhile is planted, grown to fruition, and harvested. While this process has only just begun, there are promising signs of progress in a variety of biomedical research fields (see Xi'An's Og for a roundup of examples). In pharmacometrics as well, this is an opportunity to grow something new and better, weeding out the worst of our old practices.

A plenitude of alternative statistical methods are on offer that purport to address this challenge. For example, an (anecdotally) already-popular -- but misleading -- attempt to paraphrase the ASA statement consists in the following: “Don’t use *p*-values anymore; use confidence intervals.” Whatever may be said about the merits of confidence intervals over *p*-values (and for my part, I would readily agree that the relative merits of confidence intervals are considerable), this and similar method-focused oversimplifications of the ASA statement represent a facile reduction of what is, in fact, a much richer and grander vision. Indeed, the ASA statement makes clear in its introductory paragraph that the problem cannot be reduced to a narrow one of statistical methodology: “The validity of scientific conclusions, including their reproducibility, *depends on more than the statistical methods themselves*” (emphasis mine).

Of course, this is not to say that statistical methodology is unimportant. In a section of the ASA statement entitled *Other Approaches*, attention is called to “methods that emphasize estimation over testing, such as confidence, credibility, or prediction intervals; Bayesian methods; alternative measures of evidence, such as likelihood ratios or Bayes Factors; and other approaches such as decision-theoretic modeling and false discovery rates.” Nonetheless, the paper as a whole makes it clear that any new quiver of methods, however well selected, is incommensurate with the challenge at hand. To see why this is the case, consider *Principle 4*:

Scientific conclusions and business or policy decisions should not be based only on whether ap-value passes a specific threshold.

In elaborating this principle, Wasserstein writes:

Researchers should bring many contextual factors into play to derive scientific inferences, including the design of a study, the quality of the measurements, the external evidence for the phenomenon under study, and the validity of assumptions that underlie the data analysis.

Wholeheartedly: “YES!” Above all, this is a call for richer *contextualization* of statistical results. Results can *only* take on the character of evidence when they are interpreted within a specific context of inquiry.

While some statistical methods may *facilitate* the contextualization of results, contextualization is ultimately a matter not of statistical *methods*, but of statistical *argument*. By *statistical argument*, I mean simply: reasoned scientific argument that employs (among other things) statistical evidence. Formal scientific argument generally proceeds along a well-worn and widely recognized path: *Objectives* ⇒ *Methods* ⇒ *Results* ⇒ *Discussion* ⇒ *Conclusions*. Statistical reasoning has a role to play at each station along this path, and cannot consist merely in applying methods to data to generate results. In particular, the *Discussion* section in a scientific argument eagerly awaits the richer contextualization of statistical results that the ASA statement advocates.

How then should a *statistical argument* proceed? What set of considerations permit results to take on the character of evidence? What essential features define a context of inquiry? Without attempting to be exhaustive, I would propose the following considerations as a “starter kit” appropriate to any *Discussion* in a statistical argument:

- Was the context of inquiry confirmatory, exploratory, or something in between? To what extent was analysis of the association pre-specified, and (by contrast) to what extent could it be distorted by selection bias (e.g. due to model selection, endpoint selection, selection of exposure metrics, covariate selection, etc.)?
- To what extent does the context support a causal interpretation of the observed association? For example, does the association of interest correspond to a randomized comparison? And, whether randomized or not: is the association directionally consistent with known mechanistic accounts (e.g. based on enzymology?)
- What is the practical context? Does the association have a magnitude that is consequential in some practical sense?
- Retrospectively, how robust are the results? Do they change substantively as a function of the analysis data set, or as a function of model assumptions?

I hope at this point that I have sketched with sufficient resolution what I mean by *statistical argument*, as distinct from *statistical methodology*, and I hope that I have argued convincingly that the ASA statement is a call not only to reevaluate the latter, but also -- and especially -- to enrich the former. In my next post I propose a (limited) role that I think the concept of “statistical significance” can rightly play in the context of a broader statistical argument.