Better Ways to Evaluate Research Results

Widespread reports of irreproducibility in science have propelled sentiments that scientists are prone to mistaken claims, some of them products of fabrication, falsification, or plagiarism. Multiple private organizations have arisen that seek to reproduce published work as a means of identifying purported errors, and a major federally funded reproducibility project is being planned. Such efforts, I believe, are ill-advised.

First, it is instructive to consider two distinct classes of reproducibility, which I denote replicability and robustness. Replicability refers to the capacity of “laboratories to reproduce other laboratories’ results,” as Anne L. Plant puts it in “Build Confidence in Science by Embracing Uncertainty Rather than Chasing Reproducibility” (Issues, Fall 2025). She maintains that replicability demands systematic exact-matching of experimental conditions, and she and her colleagues are working to establish a framework that advances toward replication. Importantly, however, Plant notes that neither failed nor successful replication informs validity.

The second class of reproducibility, robustness, refers to the persistence of a finding under different experimental conditions. Biological scientists, for example, assess this measure of reproducibility routinely, because a researcher who seeks to build on the prior findings of another will first test whether those findings hold in their distinct biological system and experimental conditions. Both success and failure of robustness provide insight, but as with replicability, neither determines validity.

Therefore, because reproducibility (either replication or robustness) cannot confirm or falsify a research claim, any program aimed at reproducing published work is misguided, unproductive, and wasteful, and inflicts stress and reputational damage on researchers whose work is selected for scrutiny.

Any program aimed at reproducing published work is misguided, unproductive, and wasteful, and inflicts stress and reputational damage on researchers whose work is selected for scrutiny.

Instead of reproducibility tests, then, the scientific community should document causes of error in our work, and then implement plans to address them. I suggest five causes of error, ordered by my guesstimate of frequency: (1) flawed experimental design, including failure to account for known variables and other sources of uncertainty, or to achieve statistical significance; (2) unknown variables, which always lurk in complex experimental systems and protocols; (3) overheated significance claims, insufficient acknowledgment of caveats; (4) poor experimental execution; and (5) misconduct.

Notably, accounting for potential sources of experimental uncertainty, as well-considered by Plant, is critical for both classes of reproducibility—essential for replicability, and the primary parameter in assessing robustness.

Each of the five potential sources of error can be addressed partially or fully by improved education—more mindful, more rigorous, more quantitation-forward. While unknown variables cannot be eliminated, training for awareness and acknowledgment would increase the integrity of the published record, and should be welcomed by journal editors, not viewed as grounds for rejection. For item 3, improved training must be coupled with new metrics and standards for academic hiring and promotion, evaluation and award of research funds, and publication of research findings. Items 1 and 4, addressed in part by better training, will increasingly be ameliorated by technology-enabled research ecosystems such as automated and autonomous laboratories. And diminishing misconduct can be met by enterprise-wide standards, ethics, and integrity training, joined by more attentive review of publication and funding submissions, advanced technologies for malfeasance detection, and forceful sanctions for violators.

Special Advisor to the Chancellor, Science Policy & Strategy

Vice Chancellor for Research, Emeritus

Professor, Cellular & Molecular Pharmacology, Emeritus

University of California, San Francisco

Anne L. Plant makes a valid point that lack of reproducibility between experimental studies should not be equated with results being wrong. That said, her jump to the conclusion that, in the absence of fraud, “all studies provide useful, although incomplete, information” is a non sequitur.

There is obviously opportunity for error in science, even when researchers have the best of intentions, and journal-based peer review is not an efficient way to detect it. Thus, many published results may be misleading rather than useful. If results are not reproducible, one should indeed evaluate sources of uncertainty in experiments. But irreproducible results may also arise from technical errors or statistical fluctuations, especially when appropriate statistical power is not the norm.

Laboratory scientists seem to have a hard time acknowledging this possibility. In the Brazilian Reproducibility Initiative, we performed 143 independent replications of 56 lab biology experiments, which frequently failed to reach the original results. When asked for explanations, researchers almost inevitably defaulted to differences between protocols, usually due to poor reporting in the original studies.

Strikingly, this happened even in cases where multiple replications with different protocols found consistent effects in the opposite direction. The possibility that the original result may have been a fluke was hard to elicit, even when it seemed like the most natural one, perhaps because it is seen as implying bad faith or incompetence.

There are many ways to get things wrong in the lab—as our own project also witnessed. Protocols are broken, numbers are mistyped, group labels are reversed, primer sequences are misspelled, and biases are frequent. Such errors are rarely detectable, especially when primary data are not shared—and we became aware of them in our study only because we audited our replications much more closely than regular peer review does.

Thus, although efforts for establishing minimum information guidelines for protocols are laudable, they will be useful only if the reported protocols are followed strictly—which in our experience is often not the case.

Although efforts for establishing minimum information guidelines for protocols are laudable, they will be useful only if the reported protocols are followed strictly.

All in all, Plant’s advice that scientists should embrace uncertainty is sound, but uncertainty does not come only from experimental variables. Although Figure 1 in her article lists over 50 potential forms of variability for lab experiments, one important factor seems to be missing: the fallibility of the humans who are dealing with them.

As increasingly complex methods are deployed in an academic environment that has little tradition in systematic quality control, errors become more likely, especially when career incentives reward publishing fast. If we cannot assume that some papers simply may be wrong, trying to make sense of the literature or build theory from it can become a maddening exercise.

Importantly, this does not have to imply losing trust in the integrity of researchers, or in the scientific endeavor as a whole. But for confidence in science to be preserved, it’s important to remind ourselves that it is fallible—not only to keep our expectations realistic, but to reflect on how we can do better.

Institute of Medical Biochemistry Leopoldo de Meis

Federal University of Rio de Janeiro

One thing Anne Plant’s article makes clear is the great complexity of harmonizing the nuances of similar experiments across laboratories to facilitate replication attempts or data comparisons. It is certainly important to remember that researchers could have made other decisions in the lab, and that seemingly minor differences between experiments can impact results. But the differences between any lab and the real world are even greater, as are the number of decisions researchers make in order to render the complexity of the real world tractable for study.

Careful enumerations of potential sources of variability, like those Plant lists in Figure 1, could improve the reliability of scientific papers if adopted more widely. Readers also need to understand how experiments differ from the real-world situations they aim to illuminate and the many sources of uncertainty in this comparison. Carefully enumerating details in this area as well could help increase the usability of experimental research.

Readers also need to understand how experiments differ from the real-world situations they aim to illuminate and the many sources of uncertainty in this comparison.

A trade-off between rigor and relevance has been noted in some fields that apply scientific data to real-world situations: the more rigorous academics make their studies, the less useful they can seem to people who need to apply the results. The trade-off happens because the real world is chaotic, and one way to make an experiment more replicable is to simplify the definitions, conditions, and experimental subjects involved. Some examples include agricultural experiments carried out in uniform soil and under controlled conditions quite different from the variable conditions found on farms; metabolic health studies conducted in mice with uniform genetics and standardized pellet-based diets; and psychology experiments that operationally define and measure a concept like “honesty” or “open-mindedness” in a context that is far more limited than the real-life ways people demonstrate these faculties. In clinical trial design, available options span from traditional “explanatory” trials, which give a drug or intervention the best chance to show benefit (and to replicate) by testing it in a narrow subset of well-supervised patients, all the way to the most “pragmatic” trials, which test the intervention in the broader and more variable spectrum of real-world patients and conditions under which the intervention may be used.

Because there are often such trade-offs, a formulaic emphasis on rigor could end up making research less relevant to the real world. The harmonized toxicology experiments Plant references were performed using A549, a standard cell line derived from a single Caucasian patient in 1972; low cell variability is desirable for replication success and tractability, but not necessarily for broad applicability of toxicology results. Plant argues that a more “systematic and conceptual approach” with “adequate deliberation” about uncertainty should be adopted rather than simply rigor or reproducibility checklists. I agree, and this should be extended to uncertainty in using study results to understand the real world. To deliver the most value from research, we need to ask what would increase end users’ confidence in experimental data, both in its internal validity and in its usability.

Striga Scientific

Rochester, New York

Since it was first flagged, the reproducibility crisis has been prompting speculation about the reasons behind it—typically with negative connotations. For many observers, “irreproducible” results mean flawed results, while “symptomatic irreproducibility” in results is plain wrong to accept. Anne L. Plant argues that instead of focusing on reproducibility (without really understanding it) as a way of evaluating research quality, one should assess the sources of uncertainty.

It is surprising how little attention is paid to the simple fact that reproducibility and uncertainty go hand in hand, and that it is the latter that defines the extent of the former. Biology with the badge of complexity it proudly wears is a prime example of this. And yet, little consideration is given to sources of uncertainty that may or may not be unique to the biological systems under question. Researchers like to talk about context-dependent and stochastic phenomena, and enrich their papers with statistics, but often fail to acknowledge the sources of variability. Well-meaning measures are introduced. Checklists denoting expected variables, repeats, and controls; their number and nature; and guidelines with recommendation to adhere to FAIR principles for scientific data management—Findability, Accessibility, Interoperability, and Reusability and stewardship—might indeed account for common sources of uncertainty. This would apply, for example, to differences in cell lines (if the same from the same supplier), single cell measurements versus cell populations (if from the same generation and of the same phenotype), and so on and so forth, from molecule to organism. But none of these is adequate even when considering relatively well understood scenarios. Different isoforms of the same protein can be produced from the same gene depending on the sources of production or origin, prompting the development of proteomics contributing additional sources of measurement uncertainties.

It is surprising how little attention is paid to the simple fact that reproducibility and uncertainty go hand in hand, and that it is the latter that defines the extent of the former.

Here we ought to acknowledge an elephant in the room—there exist barriers to what Plant calls for. The relentless push for success with an emphasis on positive, novel, and “exciting” findings is another block standing on the path to progress. Negative results do not excite editors or funders who may be keen on reproducibility, but not necessarily the reasons why. And yet, the irony is that the perceived failures may well prove far more valuable for others who choose to follow the same experimental strategy but have no means for an early warning.

This reemphasizes the need for a cause-and-effect analysis for evaluating the sources of uncertainty, both experimental and computational, especially the sources that are the largest and represent major factors limiting reproducibility.

Thus, reproducibility is a valuable proxy for revealing sources of uncertainty and addressing them in a systematic and fit-for-purpose manner to build a body of knowledge that can be systematically applied to experiments. This approach removes the burden of salvaging irreproducible results from others’ results and eliminates a major barrier to scientific progress. By systematically addressing sources of uncertainty rather than chasing reproducibility for its own sake, we can build a reliable body of knowledge that is fit for purpose and applicable to future experiments. This shift reduces wasted effort and accelerates scientific advancement.

National Physical Laboratory

Teddington, Middlesex, United Kingdom

Cite this Article

“Better Ways to Evaluate Research Results.” Issues in Science and Technology 42, no. 2 (Winter 2026).

Vol. XLII, No. 2, Winter 2026