Margins of Error

Script / Documentation


Ever lost a bet? From the lottery or sporting events to casinos or friendly wagers, you may have risked and lost some money because you hoped to win big.

But let me ask you this: How big would the payout have to be and how good would the odds need to be to gamble with your life or the lives of your loved ones?

In this lesson from Just Facts Academy about Margins of Error, we’ll show you how people do that without even realizing it. And more importantly, we’ll give you the tools you need to keep you from falling into this trap.

Ready? C’mon, what have you got to lose?

People often use data from studies, tests, and surveys to make life-or-death decisions, like about what medicines they should take, what kinds of foods they should eat, and what activities should they embrace or avoid.

The problem is such data is that it isn’t always as concrete as the media and certain scholars make it out to be.

Look at it this way. There are four layers to this “margin of error” cake. Let’s start with the simplest one, like this headline from the Los Angeles Times, which declares, “California sea levels to rise 5-plus feet this century, study says.”[1]

That sounds pretty scary, but the study has margins of error, and it actually predicts a sea-level rise of 17 to 66 inches.[2] In the body of the article, the reporter walks back the headline a little, but he fails to provide even a hint that the “5-plus-feet” is the upper bound of an estimate that extends all the way down to a quarter of this.[3]

Studies often have margins of error, or bounds of uncertainty, so the moment you hear someone summarize a study with a single figure, dig deeper. This is the same principle taught in Just Facts Academy’s lesson on Primary Sources: Don’t rely on secondary sources because they often reflect someone’s interpretation of the facts—instead of the actual facts.

Also, don’t assume that the authors of the primary sources will report the vital margins of error near the top of their studies. In the famed Bangladesh face mask study, for example, the authors lay down 4,000 words before they disclose a range of uncertainty that undercuts their primary finding.[4] [5] [6]

Here are a few more tips to help you critically examine margins of error.

Surveys often present their results like this:

11.5% ± 0.3

It’s quite simple. The first number is the nominal or best estimate, technically called the “point estimate.” The second number is the margin of error.

In the case of this survey,[7] it means that the best estimate of the U.S. poverty rate is 11.5%, but the actual figure may be as low as 11.2% or as high as 11.8%.

Scholarly publications often use a less intuitive convention and present their results like this:

4.70; 95% CI, 1.77–12.52

Now, don’t let this barrage of digits intimidate you. They’re actually easy to understand once you crack the code.

The first number is the best estimate. In the case of this study,[8] it means that bisexual men are roughly 4.7 times “more likely to report severe psychological distress” than heterosexual men.

The last two digits are the outer bounds of the study’s results after the margins of error are included. They mean that bisexual men are about 1.8 to 12.5 times more likely to report distress than heterosexual men.[9]

That’s a really broad range, especially when compared to the single figure of 4.7. Do you see why margins of error are so essential?

Now, here’s something a lot of people don’t know because journalists rarely explain it or don’t understand it: Reported margins of error and ranges of uncertainty typically account for just one type of error known as sampling error.[10] [11] [12] [13] [14] [15] This is based purely on the size of the sample used for the study or survey. Generally speaking, the larger the sample, the smaller the margin of sampling error.[16]

It’s super important to be aware of this, because there are often other layers of uncertainty that aren’t reflected in sampling errors.[17] [18] [19] [20] Figures like 1.77 to 12.52 sound very specific and solid, but that can be an illusion. If you don’t understand this, you can be easily misled to believe that the results of a study are ironclad when they are not.

This brings us to the “95% CI.” What does that mean?

It stands for “95% confidence interval,”[21] and contrary to what your statistics teacher may have told you,[22] it generally means that there’s a 95% chance the upper and lower bounds of the study contain the real figure.[23] That means there’s a 5% chance they don’t.

How’s that for gambling? Would you step outside your home today if you knew there was a 1 in 100 chance you wouldn’t make it back alive? Well, even the outer bounds of most study results are less certain than that.

You see, time, money, and circumstances often limit the sizes of studies, tests, and surveys.[24] So even if their methodologies are sound, reality may lie outside the bounds of the results due to mere chance.[25]

On top of this, some studies measure multiple types of outcomes while failing to account for the fact that each attempt to measure a separate outcome increases the likelihood of getting a seemingly solid result due to pure chance.[26] Look at it this way: if you roll of pair of dice 12 times, you’re 12 times more likely roll a 2 than if you roll them once.

Even worse, there are scholars who roll those dice behind the scenes by calculating different outcomes until they find one that provides a result they want. And that’s the only one they’ll tell you about.[27] [28]

Now, let’s take a step back and look at the layers of the cake:

  • First, you have the point estimate.
  • Then, you have the outer bounds, which commonly account for the margin of sampling error but no other sources of uncertainty.
  • Then, you have the confidence interval percentage, or the probability that the outer bounds are correct.

We’ll get to the base layer in a moment, but now is a good time to talk about a concept called “statistical significance,” because we’ve cut through enough cake to understand it.

Study results are typically labeled “statistically significant” if the margin of sampling error with 95% confidence is entirely positive or entirely negative.[29] [30] [31] [32]

For example, if a medical study finds a treatment is 10% to 30% effective with 95% confidence, this is considered to be a statistically significant outcome. That’s a shorthand way of saying the result probably isn’t due to sampling error.[33]

And if a study finds that a treatment is –10% to 30% effective with 95% confidence, such a result is considered to be “statistically insignificant” because it crosses past the line of zero effect.[34] [35] [36] [37] This could mean that the treatment has a positive effect, or no effect, or a negative effect.[38] [39]

One way to sort this out is to look at the size of the study sample. If it’s relatively large, and the results are statistically insignificant, that’s a pretty good indication the effect is trivial.[40] [41] [42] [43]

Hundreds of scholars have called for ending the convention of labeling results as “statistically significant” or “insignificant.” This is because it can lead people to jump to false conclusions.[44] Nonetheless, it’s a common practice,[45] [46] [47] [48] so here are some tips to avoid such risky leaps:

  • One, don’t mistake statistical significance for real-world importance. A study’s results can be statistically significant but also tiny or irrelevant.[49] [50]
  • Two, don’t assume that a statistically insignificant result means there’s no difference or no effect.[51] Sometimes studies are underpowered, which means their samples are too small to detect statistically significant results.[52] In other words, there’s a major distinction between saying that a study “found no statistically significant effect” and saying “there’s no effect.”[53]
  • Third and most importantly, don’t fall into the trap of believing that a study is reliable just because the results are statistically significant.[54] That’s the final layer to the cake, and it’s where the riskiest gambling occurs.

Here’s what I mean.

The study on sea level rise we discussed—well, it’s based on a computer model,[55] a type of study that is notoriously unreliable.[56] [57] [58] [59] [60] [61]

And the study about psychological distress and sexuality—it’s an observational study,[62] which can rarely determine cause and effect, even though scholars falsely imply or explicitly claim that they do.[63] [64] [65] [66] [67]

Then there’s all kinds of survey-related errors exposed by Just Facts’ lesson on Deconstructing Polls & Surveys.

Bottom line—the “margins of error” reported by journalists and scholars rarely account for many other sources of error.

Gone are the days when you can blindly trust a study just because it is publicized by your favorite news source, appears in a peer-reviewed journal, was written by a PhD, or is endorsed by a government agency or professional association.

Incompetence and dishonesty are simply far too rampant to outsource major life decisions without critical analysis.

So don’t gamble your life on “experts” who offer solid bets that “you can’t lose.” Instead, keep it locked to Just Facts Academy, so you can learn how to research like a genius.


Footnotes

[1] Article: “California Sea Levels to Rise 5-Plus Feet This Century, Study Says.” By Tony Barboza. Los Angeles Times, June 24, 2012. <articles.latimes.com>

Sea levels along the California coast are expected to rise up to 1 foot in 20 years, 2 feet by 2050 and as much as 5 1/2 feet by the end of the century, climbing slightly more than the global average and increasing the risk of flooding and storm damage, a new study says. …

Coastal California could see serious damage from storms within a few decades, especially in low-lying areas of Southern California and the Bay Area. San Francisco International Airport, for instance, could flood if the sea rises a little more than a foot, a mark expected to be reached in the next few decades. Erosion could cause coastal cliffs to retreat more than 100 feet by 2100, according to the report.

[2] Paper: “Sea-Level Rise for the Coasts of California, Oregon, and Washington: Past, Present, and Future.” By the Committee on Sea Level Rise in California, Oregon, and Washington, National Research Council. National Academies Press, 2012. <www.nap.edu>

Pages 4–6:

For the California coast south of Cape Mendocino, the committee projects that sea level will rise 4–30 cm [2–12 inches] by 2030 relative to 2000, 12–61 cm [5–24 inches] by 2050, and 42–167 cm [17–66 inches] by 2100.

[3] Article: “California Sea Levels to Rise 5-Plus Feet This Century, Study Says.” By Tony Barboza. Los Angeles Times, June 24, 2012. <articles.latimes.com>

Sea levels along the California coast are expected to rise up to 1 foot in 20 years, 2 feet by 2050 and as much as 5 1/2 feet by the end of the century, climbing slightly more than the global average and increasing the risk of flooding and storm damage, a new study says. …

Coastal California could see serious damage from storms within a few decades, especially in low-lying areas of Southern California and the Bay Area. San Francisco International Airport, for instance, could flood if the sea rises a little more than a foot, a mark expected to be reached in the next few decades. Erosion could cause coastal cliffs to retreat more than 100 feet by 2100, according to the report.

[4] Article: “Famed Bangladesh Mask Study Excluded Crucial Data.” By James D. Agresti. Just Facts, April 8, 2022. <www.justfactsdaily.com>

Beyond excluding the death data, the authors engaged in other actions that reflect poorly on their integrity. One of the worst is touting their findings with far more certainty than warranted by the actual evidence. For example, some of the authors wrote a New York Times op-ed declaring that “masks work,” a claim undercut by the following facts from their own study: …

• Their study’s “primary outcome,” a positive blood test for Covid-19 antibodies, found that less than 1% of the participants caught C-19, including 0.68% in villages where people were pressured to wear masks, and 0.76% in villages that were not. This is a total difference of 0.08 percentage points in a study of more than 300,000 people.

• Their paper lays down 4,000 words before it reveals the sampling margins of error in the results above, which show with 95% confidence that … cloth masks reduced the risk of catching symptomatic C-19 by as much as 23% or increased the risk by as much as 8%.

• “Not statistically significant” is the common term used to describe study results that aren’t totally positive or totally negative throughout the full margin of error, like the results above. Yet, the authors skip this fact in their op-ed and bury it in their paper, writing at the end of an unrelated paragraph that it showed “no statistically significant effect for cloth masks.”

NOTE: The next two footnotes document the primary sources.

[5] Paper: “Impact of Community Masking on COVID-19: A Cluster-Randomized Trial in Bangladesh.” By Jason Abaluck and others. Science, December 2, 2021. <www.science.org>

Page 3:

We find clear evidence that surgical masks lead to a relative reduction in symptomatic seroprevalence of 11.1% (adjusted prevalence ratio = 0.89 [0.78, 1.00]; control prevalence = 0.81%; treatment prevalence = 0.72%). Although the point estimates for cloth masks suggests that they reduce risk, the confidence limits include both an effect size similar to surgical masks and no effect at all (adjusted prevalence ratio = 0.94 [0.78, 1.10]; control = 0.67%; treatment = 0.61%).

NOTE: The quote above is buried 4,000 words into the paper. Moreover, the authors misleadingly describe these results. The outer bound of “1.00” for surgical masks actually means no effect at all, but the authors fail to use this term when describing that outcome. Instead, they use the term “no effect at all” to describe the outer bound of “1.10” for cloth masks when this actually means a 10% increase in the risk catching Covid-19.

Page 4:

We find clear evidence that the intervention reduced symptoms: We estimate a reduction of 11.6% (adjusted prevalence ratio = 0.88 [0.83, 0.93]; control = 8.60%; treatment = 7.63%). Additionally, when we look separately by cloth and surgical masks, we find that the intervention led to a reduction in COVID-19–like symptoms under either mask type (p = 0.000 for surgical; p = 0.066 for cloth), but the effect size in surgical mask villages was 30 to 80% larger depending on the specification. In table S9, we run the same specifications using the smaller sample used in our symptomatic seroprevalence regression (i.e., those who consented to give blood). In this sample, we continue to find an effect overall and an effect for surgical masks but see no statistically significant effect for cloth masks.

[6] Commentary: “We Did the Research: Masks Work, and You Should Choose a High Quality Mask if Possible.” By Jason Abaluck, Laura H. Kwong, and Stephen P. Luby. <www.nytimes.com>

“The bottom line is masks work, and higher quality masks most likely work better at preventing Covid-19.”

[7] Report: “Poverty in the United States: 2022.” By Emily A. Shrider and John Creamer. U.S. Census Bureau, September 2023. <www.census.gov>

Pages 20–21:

Table A-1. People in Poverty by Selected Characteristics: 2021 and 2022

2022 … Below poverty … Percent [=] 11.5 … Margin of error1 (±) [=] 0.3 …

1 A margin of error (MOE) is a measure of an estimate’s variability. The larger the MOE in relation to the size of the estimate, the less reliable the estimate. This number, when added to and subtracted from the estimate, forms the 90 percent confidence interval. MOEs shown in this table are based on standard errors calculated using replicate weights.

[8] Paper: “Comparison of Health and Health Risk Factors Between Lesbian, Gay, and Bisexual Adults and Heterosexual Adults in the United States.” By Gilbert Gonzales, Julia Przedworski, and Carrie Henning-Smith. Journal of the American Medical Association, June 27, 2016. <archinte.jamanetwork.com>

Data from the nationally representative 2013 and 2014 National Health Interview Survey were used to compare health outcomes among lesbian (n = 525), gay (n = 624), and bisexual (n = 515) adults who were 18 years or older and their heterosexual peers (n = 67 150) using logistic regression. …

After controlling for sociodemographic characteristics … bisexual men were more likely to report severe psychological distress (OR, 4.70; 95% CI, 1.77-12.52), heavy drinking (OR, 3.15; 95% CI, 1.22-8.16), and heavy smoking (OR, 2.10; 95% CI, 1.08-4.10) than heterosexual men….

[9] Paper: “Comparison of Health and Health Risk Factors Between Lesbian, Gay, and Bisexual Adults and Heterosexual Adults in the United States.” By Gilbert Gonzales, Julia Przedworski, and Carrie Henning-Smith. Journal of the American Medical Association, June 27, 2016. <archinte.jamanetwork.com>

Data from the nationally representative 2013 and 2014 National Health Interview Survey were used to compare health outcomes among lesbian (n = 525), gay (n = 624), and bisexual (n = 515) adults who were 18 years or older and their heterosexual peers (n = 67 150) using logistic regression. …

After controlling for sociodemographic characteristics … bisexual men were more likely to report severe psychological distress (OR, 4.70; 95% CI, 1.77-12.52), heavy drinking (OR, 3.15; 95% CI, 1.22-8.16), and heavy smoking (OR, 2.10; 95% CI, 1.08-4.10) than heterosexual men….

[10] Article: “The Myth of Margin of Error.” By Jeffrey Henning. Researchscape, October 13, 2017. <researchscape.com>

The margin of sampling error is widely reported in public opinion surveys because it is the only error that can be easily calculated. …

In fact, many researchers will just “do the math” to calculate sampling error, ignoring the fact that the assumptions behind the calculation aren’t being met.

[11] Article: “Iowa Poll: Kamala Harris Leapfrogs Donald Trump to Take Lead Near Election Day. Here’s How.” By  Brianne Pfannenstiel. Des Moines Register, November 2, 20224. Updated November 7, 2024. <www.desmoinesregister.com>

A new Des Moines Register/Mediacom Iowa Poll shows Vice President Harris leading former President Trump 47% to 44% among likely voters just days before a high-stakes election that appears deadlocked in key battleground states. …

The poll of 808 likely Iowa voters, which include those who have already voted as well as those who say they definitely plan to vote, was conducted by Selzer & Co. from Oct. 28-31. It has a margin of error of plus or minus 3.4 percentage points. …

Questions based on the sample of 808 Iowa likely voters have a maximum margin of error of plus or minus 3.4 percentage points. This means that if this survey were repeated using the same questions and the same methodology, 19 times out of 20, the findings would not vary from the true population value by more than plus or minus 3.4 percentage points.

NOTES:

  • Trump won Iowa by 13.2 percentage points, receiving 55.7% of the vote as compared to 42.5% for Harris.
  • The “maximum margin of error” reported in this article was only the sampling margin of error, as documented in the footnote above.

[12] Post: “Significant Marginal Effects but C.I.S for Predicted Margins Overlapping.” By Dr. Clyde Schechter (Albert Einstein College of Medicine). Statalist, October 10, 2017. <www.statalist.org>

If one of the goals is to assess the predicted margins, then they should be presented with confidence intervals because every estimate should always be given with an estimate of the associated uncertainty. (The confidence interval represents a bare minimum estimate of the uncertainty of any estimate in that it accounts only for sampling error, but it is better than nothing.)

[13] Article: “Handling Missing Within-Study Correlations in the Evaluation of Surrogate Endpoints.” By Willem Collier and others. Statistics in Medicine, September 3, 2003. <pmc.ncbi.nlm.nih.gov>

To reduce bias in measures of the performance of the surrogate, the statistical model must account for the sampling error in each trial’s estimated treatment effects and their potential correlation.

A weighted least squares (WLS) approach is also frequently used…. The WLS method accounts only for sampling error of estimated effects on the clinical endpoint.

[14] Paper: “Measuring Coverage in MNCH: Total Survey Error and the Interpretation of Intervention Coverage Estimates from Household Surveys.” PLoS Medicine. May 7, 2013. <pmc.ncbi.nlm.nih.gov>

Nationally representative household surveys are increasingly relied upon to measure maternal, newborn, and child health (MNCH) intervention coverage at the population level in low- and middle-income countries. Surveys are the best tool we have for this purpose and are central to national and global decision making. However, all survey point estimates have a certain level of error (total survey error) comprising sampling and non-sampling error, both of which must be considered when interpreting survey results for decision making. … Sampling error is usually thought of as the precision of a point estimate and is represented by 95% confidence intervals, which are measurable. … By contrast, the direction and magnitude of non-sampling error is almost always unmeasurable, and therefore unknown.

[15] Report: “2023 Crime in the United States.” Federal Bureau of Investigation, September 2024. <www.justfacts.com>

Page 39 (of the PDF):

BJS [Bureau of Justice Statistics] derives the NCVS [National Crime Victimization Survey] estimates from interviewing a sample. The estimates are subject to a margin of error. This error is known and is reflected in the standard error of the estimate.

NOTE: As documented in the footnote above, the “margin of error” in this survey only accounts for the sampling margin of error.

[16] Book: Statistics for K–8 Educators. By Robert Rosenfeld. Routledge, 2013.

Page 92:

In general, larger random samples will produce smaller margins of error. However, in the real world of research where a study takes time and costs money, at a certain point you just can’t afford to increase the sample size. Your study will take too long or you may decide the increase in precision isn’t worth the expense. For instance, if you increase the sample size from 1,000 to 4,000 the margin of error will drop from about 3% to about 2%, but you might quadruple the cost of your survey.

[17] Paper: “Measuring Coverage in MNCH: Total Survey Error and the Interpretation of Intervention Coverage Estimates from Household Surveys.” PLoS Medicine. May 7, 2013. <pmc.ncbi.nlm.nih.gov>

Sampling error is usually thought of as the precision of a point estimate and is represented by 95% confidence intervals, which are measurable. … By contrast, the direction and magnitude of non-sampling error is almost always unmeasurable, and therefore unknown.

[18] Post: “Significant Marginal Effects but C.I.S for Predicted Margins Overlapping.” By Dr. Clyde Schechter (Albert Einstein College of Medicine). Statalist, October 10, 2017. <www.statalist.org>

If one of the goals is to assess the predicted margins, then they should be presented with confidence intervals because every estimate should always be given with an estimate of the associated uncertainty. (The confidence interval represents a bare minimum estimate of the uncertainty of any estimate in that it accounts only for sampling error, but it is better than nothing.)

[19] Report: “How Crime in the United States Is Measured.” Congressional Research Service, January 3, 2008. <crsreports.congress.gov>

Pages 26–27:

Because the NCVS [National Crime Victimization Survey] is a sample survey, it is subject to both sampling and non-sampling error, meaning that the estimated victimization rate might not accurately reflect the true victimization rate. Whenever samples are used to represent entire populations, there could be a discrepancy between the sample estimate and the true value of what the sample is trying to estimate. …

The NCVS is also subject to non-sampling error. The methodology employed by the NCVS attempts to reduce the effects of non-sampling error as much as possible, but an unquantified amount remains.242

[20] Report: “Estimating the Incidence of Rape and Sexual Assault.” Edited by Candace Kruttschnitt, William D. Kalsbeek, and Carol C. House. National Academy of Sciences, National Research Council, 2014. <nap.nationalacademies.org>

Page 4:

All surveys are subject to errors, and the NCVS [National Crime Victimization Survey] is no exception. An assessment of the errors and potential errors in a survey is important to understanding the overall quality of the estimates from that survey and to initiate improvements. Total survey error is a concept that involves a holistic view of all potential errors in a survey program, including both sampling error and various forms of nonsampling error.

[21] Report: “How Crime in the United States Is Measured.” Congressional Research Service, January 3, 2008. <crsreports.congress.gov>

Page 26:

Because the NCVS [National Crime Victimization Survey] is a sample survey, it is subject to both sampling and non-sampling error, meaning that the estimated victimization rate might not accurately reflect the true victimization rate. Whenever samples are used to represent entire populations, there could be a discrepancy between the sample estimate and the true value of what the sample is trying to estimate. The NCVS accounts for sampling error by calculating confidence intervals for estimated rates of victimization.238 For example, in 2000, the estimated violent crime victimization rate was 27.9 victimizations per 100,000 people aged 12 and older.239 The calculated 95% confidence interval240 for the estimated violent crime victimization rate was 25.85 to 29.95 victimizations per 100,000 people aged 12 and older.241

[22] Paper: The Correct Interpretation of Confidence Intervals. By Sze Huey and Say Beng Tan. Proceedings of Singapore Healthcare, 2010. <journals.sagepub.com>

Page 277:

A common misunderstanding about CIs is that for say a 95% CI (A to B), there is a 95% probability that the true population mean lies between A and B. This is an incorrect interpretation of 95% CI because the true population mean is a fixed unknown value that is either inside or outside the CI with 100% certainty. As an example, let us assume that we know that the true population mean systolic blood pressure and it is 120mmHg. A study conducted gave us a mean systolic blood pressure of 105mmHg with a 95% CI of (95.5 to 118.9 mmHg). Knowing that the true population mean is 120mmHg it would be incorrect to say that there is a 95% probability that the true population mean lies in the 95% CI of (95.5 to 118.9mmHg) because we are certain that the 95% CI calculated did not contain the true population mean. A 95% CI simply means that if the study is conducted multiple times (multiple sampling from the same population) with corresponding 95% CI for the mean constructed, we expect 95% of these CIs to contain the true population mean

[23] Article: “What Does a Confidence Interval Mean?” By Allen B. Downey (Ph.D.), 2023. <allendowney.github.io>

Here’s a question from the Reddit statistics forum (with an edit for clarity):

Why does a confidence interval not tell you that 90% of the time, [the true value of the population parameter] will be in the interval, or something along those lines?

I understand that the interpretation of confidence intervals is that with repeated samples from the population, 90% of the time the interval would contain the true value of whatever it is you’re estimating. What I don’t understand is why this method doesn’t really tell you anything about what that parameter value is.

This is, to put it mildly, a common source of confusion. And here is one of the responses:

From a frequentist perspective, the true value of the parameter is fixed. Thus, once you have calculated your confidence interval, one if two things are true: either the true parameter value is inside the interval, or it is outside it. So the probability that the interval contains the true value is either 0 or 1, but you can never know which.

This response is the conventional answer to this question—it is what you find in most textbooks and what is taught in most classes. And, in my opinion, it is wrong. To explain why, I’ll start with a story.

Suppose Frank and Betsy visit a factory where 90% of the widgets are good and 10% are defective. Frank chooses a part at random and asks Betsy, “What is the probability that this part is good?”

Betsy says, “If 90% of the parts are good, and you choose one at random, the probability is 90% that it is good.”

“Wrong!” says Frank. “Since the part has already been manufactured, one of two things must be true: either it is good or it is defective. So the probability is either 100% or 0%, but we don’t know which.”

Frank’s argument is based on a strict interpretation of frequentism, which is a particular philosophy of probability. But it is not the only interpretation, and it is not a particularly good one. In fact, it suffers from several flaws. This example shows one of them—in many real-world scenarios where it would be meaningful and useful to assign a probability to a proposition, frequentism simply refuses to do so.

Fortunately, Betsy is under no obligation to adopt Frank’s interpretation of probability. She is free to adopt any of several alternatives that are consistent with her commonsense claim that a randomly-chosen part has a 90% probability of being functional. …

Suppose that Frank is a statistics teacher and Betsy is one of his students. …

Now suppose Frank asks, “What is the probability that this CI contains the actual value of μ that I chose?”

Betsy says, “We have established that 90% of the CIs generated by this process contain μ, so the probability that this CI contains is 90%.”

And of course Frank says “Wrong! Now that we have computed the CI, it is unknown whether it contains the true parameter, but it is not random. The probability that it contains μ is either 100% or 0%. We can’t say it has a 90% chance of containing μ.”

Once again, Frank is asserting a particular interpretation of probability—one that has the regrettable property of rendering probability nearly useless. Fortunately, Betsy is under no obligation to join Frank’s cult.

Under most reasonable interpretations of probability, you can say that a specific 90% CI has a 90% chance of containing the true parameter. There is no real philosophical problem with that.

[24] Book: Statistics for K–8 Educators. By Robert Rosenfeld. Routledge, 2013.

Page 92:

In general, larger random samples will produce smaller margins of error. However, in the real world of research where a study takes time and costs money, at a certain point you just can’t afford to increase the sample size. Your study will take too long or you may decide the increase in precision isn’t worth the expense. For instance, if you increase the sample size from 1,000 to 4,000 the margin of error will drop from about 3% to about 2%, but you might quadruple the cost of your survey.

[25] Report: “Drug Use, Dependence, and Abuse Among State Prisoners and Jail Inmates, 2007–2009.” By Jennifer Bronson and others. U.S. Department of Justice, Bureau of Justice Statistics, June 2017. <bjs.ojp.gov>

Page 19:

Standard errors and tests of significance

As with any survey, the NIS [National Inmate Surveys] estimates are subject to error arising from their basis on a sample rather than a complete enumeration of the population of adult inmates in prisons and jails. …

A common way to express this sampling variability is to construct a 95% confidence interval around each survey estimate.

[26] Paper: “Multiple Inference and Gender Differences in the Effects of Early Intervention: A Reevaluation of the Abecedarian, Perry Preschool, and Early Training Projects.” By Michael L. Anderson. Journal of the American Statistical Association, December 2008. Pages 1481–1495. <are.berkeley.edu>

Page 1481:

This article focuses on the three prominent early intervention experiments: the Abecedarian Project, the Perry Preschool Program, and the Early Training Project. …

But serious statistical inference problems affect these studies. The experimental samples are very small, ranging from approximately 60 to 120. Statistical power is therefore limited, and the results of conventional tests based on asymptotic theory may be misleading. More importantly, the large number of measured outcomes raises concerns about multiple inference: Significant coefficients may emerge simply by chance, even if there are no treatment effects. This problem is well known in the theoretical literature … and the biostatistics field … but has received limited attention in the policy evaluation literature. These issues—combined with a puzzling pattern of results in which early test score gains disappear within a few years and are followed a decade later by significant effects on adult outcomes—have created serious doubts about the validity of the results….

Page 1484:

[M]ost randomized evaluations in the social sciences test many outcomes but fail to apply any type of multiple inference correction. To gauge the extent of the problem, we conducted a survey of randomized evaluation works published from 2004 to 2006 in the fields of economic or employment policy, education, criminology, political science or public opinion, and child or adolescent welfare. Using the CSA Illumina social sciences databases, we identified 44 such articles in peer-reviewed journals. …

Nevertheless, only 3 works (7%) implemented any type of multiple-inference correction. … Although multiple-inference corrections are standard (and often mandatory) in psychological research … they remain uncommon in other social sciences, perhaps because practitioners in these fields are unfamiliar with the techniques or because they have seen no evidence that they yield more robust conclusions.

Pages 1493–1494:

As a final demonstration of the value of correcting for multiple inference, we conduct a stand-alone reanalysis of the Perry Preschool Project, arguably the most influential of the three experiments. …

[A] conventional research design [i.e., one that does not account for multiple inference problems] … adds eight more significant or marginally significant outcomes: female adult arrests, female employment, male monthly income, female government transfers, female special education rates, male drug use (in the adverse direction), male employment, and female monthly income. Of these eight outcomes, two (male and female monthly income) are not included in the other two studies [Abecedarian and Early Training]. The remaining six fail to replicate in either of the other studies. …

[Previous] researchers have emphasized the subset of unadjusted significant outcomes rather than applying a statistical framework that is robust to problems of multiple inference. …

Many studies in this field test dozens of outcomes and focus on the subset of results that achieve significance.

[27] Paper: “HARKing, Cherry-Picking, P-Hacking, Fishing Expeditions, and Data Dredging and Mining as Questionable Research Practices.” Journal of Clinical Psychiatry, February 18, 2021. <www.psychiatrist.com>

P-hacking is a QRP [questionable research practice] wherein a researcher persistently analyzes the data, in different ways, until a statistically significant outcome is obtained; the purpose is not to test a hypothesis but to obtain a significant result. Thus, the researcher may experiment with different statistical approaches to test a hypothesis; or may include or exclude covariates; or may experiment with different cutoff values; or may split groups or combine groups; or may study different subgroups; and the analysis stops either when a significant result is obtained or when the researcher runs out of options. The researcher then reports only the approach that led to the desired result.3,8

[28] Paper: “Big Little Lies: A Compendium and Simulation of p-Hacking Strategies.” By Angelika M. Stefan and Felix D. Schönbrodt. Royal Society Open Science, February 2023. <royalsocietypublishing.org>

In an academic system that promotes a ‘publish or perish’ culture, researchers are incentivized to exploit degrees of freedom in their design, analysis and reporting practices to obtain publishable outcomes [1]. In many empirical research fields, the widespread use of such questionable research practices has damaged the credibility of research results [2–5]. Ranging in the grey area between good practice and outright scientific misconduct, questionable research practices are often difficult to detect, and researchers are often not fully aware of their consequences [6–8].

One of the most prominent questionable research practices is p-hacking [4,9]. Researchers engage in p-hacking in the context of frequentist hypothesis testing, where the p-value determines the test decision. If the p-value is below a certain threshold α, it is labelled ‘significant’, and the null hypothesis can be rejected. In this paper, we define p-hacking broadly as any measure that a researcher applies to render a previously non-significant p-value significant.

p-hacking was first described by De Groot [10] as a problem of multiple testing and selective reporting. The term ‘p-hacking’ appeared shortly after the onset of the replication crisis [9,11], and the practice has since been discussed as one of the driving factors of false-positive results in the social sciences and beyond [12–14]. Essentially, p-hacking exploits the problem of multiplicity, that is, α-error accumulation due to multiple testing [15]. Specifically, the probability to make at least one false-positive test decision increases as more hypothesis tests are conducted [16,17]. When researchers engage in p-hacking, they conduct multiple hypothesis tests without correcting for the α-error accumulation, and report only significant results from the group of tests. This practice dramatically increases the percentage of false-positive results in the published literature [18].

[29] Article: “In Research, What Does A ‘Significant Effect’ Mean?” By Matthew Di Carlo (PhD). Albert Shanker Institute, November 1, 2011. <www.shankerinstitute.org>

Then there’s the term “significant.” “Significant” is of course a truncated form of “statistically significant.” Statistical significance means we can be confident that a given relationship is not zero. That is, the relationship or difference is probably not just random “noise.” A significant effect can be either positive (we can be confident it’s greater than zero) or negative (we can be confident it’s less than zero). In other words, it is “significant” insofar as it’s not nothing. The better way to think about it is “discernible.” There’s something there.

[30] Paper: “Effectiveness of Adding a Mask Recommendation to Other Public Health Measures to Prevent SARS-CoV-2 Infection in Danish Mask Wearers.” By Henning Bundgaard and others. Annals of Internal Medicine, November 18, 2020. <www.acpjournals.org>

“Although the difference observed was not statistically significant, the 95% CIs [confidence intervals] are compatible with a 46% reduction to a 23% increase in infection.”

[31] Report: “Drug Use, Dependence, and Abuse Among State Prisoners and Jail Inmates, 2007–2009.” By Jennifer Bronson and others. U.S. Department of Justice, Bureau of Justice Statistics, June 2017. <bjs.ojp.gov>

Page 19:

Standard errors and tests of significance

As with any survey, the NIS [National Inmate Survey] estimates are subject to error arising from their basis on a sample rather than a complete enumeration of the population of adult inmates in prisons and jails. …

A common way to express this sampling variability is to construct a 95% confidence interval around each survey estimate.

[32] Paper: “School Vouchers and Student Outcomes: Experimental Evidence from Washington, DC.” By Patrick J. Wolf and others. Journal of Policy Analysis and Management, Spring 2013. Pages 246-270. <onlinelibrary.wiley.com>

Page 258: “Results are described as statistically significant or highly statistically significant if they reach the 95 percent or 99 percent confidence level, respectively.”

[33] Article: “Statistical Significance.” By Michael McDonough. Encyclopaedia Britannica. Last updated October 15, 2024. <www.britannica.com/topic/statistical-significance>

“Statistical significance implies that an observed result is not due to sampling error.”

[34] Article: “In Research, What Does A ‘Significant Effect’ Mean?” By Matthew Di Carlo (PhD). Albert Shanker Institute, November 1, 2011. <www.shankerinstitute.org>

Then there’s the term “significant.” “Significant” is of course a truncated form of “statistically significant.” Statistical significance means we can be confident that a given relationship is not zero. That is, the relationship or difference is probably not just random “noise.” A significant effect can be either positive (we can be confident it’s greater than zero) or negative (we can be confident it’s less than zero). In other words, it is “significant” insofar as it’s not nothing. The better way to think about it is “discernible.” There’s something there.

[35] Paper: “Relative Plasma Volume Monitoring During Hemodialysis Aids the Assessment of Dry-Weight.” By Arjun D Sinha, Robert P Light, and Rajiv Agarwal. Hypertension, December 28, 2009. <pmc.ncbi.nlm.nih.gov>

“Mean changes and their 95% confidence intervals are shown. If the confidence interval crosses zero, the mean is statistically insignificant at the 5% level.”

[36] Paper: “Insignificant Effect of Arctic Amplification on the Amplitude of Midlatitude Atmospheric Waves.” By Russell Blackport and James A Screen. Science Advances, February 19, 2020. <pmc.ncbi.nlm.nih.gov>

“In all cases, the spread of the modeled LWA [local wave activity] trends crosses zero, consistent with the statistically insignificant observed multidecadal trends.”

[37] Paper: “Effectiveness of Adding a Mask Recommendation to Other Public Health Measures to Prevent SARS-CoV-2 Infection in Danish Mask Wearers.” By Henning Bundgaard and others. Annals of Internal Medicine, November 18, 2020. <www.acpjournals.org>

“Although the difference observed was not statistically significant, the 95% CIs [confidence intervals] are compatible with a 46% reduction to a 23% increase in infection.”

[38] Paper: “Effectiveness of Adding a Mask Recommendation to Other Public Health Measures to Prevent SARS-CoV-2 Infection in Danish Mask Wearers.” By Henning Bundgaard and others. Annals of Internal Medicine, November 18, 2020. <www.acpjournals.org>

“Although the difference observed was not statistically significant, the 95% CIs [confidence intervals] are compatible with a 46% reduction to a 23% increase in infection.”

[39] Paper: “A Review of High Impact Journals Found That Misinterpretation of Non-Statistically Significant Results From Randomized Trials Was Common.” By Karla Hemming, Iqra Javid, and Monica Taljaard. Journal of Clinical Epidemiology, May 2022. <www.sciencedirect.com>

The first and most problematic issue is when inconclusive trials are interpreted as providing definitive evidence that the treatment under evaluation is ineffective [10]. This is referred to as conflating no evidence of a difference with evidence of no difference (i.e., conflating absence of evidence with evidence of absence) [1]. …

Almost all abstracts of RCTs [randomized controlled trials] published in high impact journals with non-statistically significant primary outcomes appropriately report treatment effects and confidence intervals, yet most make definitive conclusions about active treatments being no different to the comparator treatment, despite this being prima facia [at first sight] inconsistent with a non-statistically significant primary outcome result. … In addition, a large number of studies unhelpfully provide no informative interpretation: in the overall conclusion they simply state that the result is non-statistically significant, despite having reported confidence intervals in the results section. … Clear statements that the study finding is inconclusive (i.e., when the confidence interval provides support for both benefit and harm) in reports of RCTs in high impact journals are rare. Despite high profile campaigns in 2016 to put a stop to this poor practice [38], our review demonstrates that the practice of misinterpretation is still highly prevalent. …

Thus, it might be possible that some studies which reported an overall interpretation of no difference between the two treatment arms were correct in this interpretation: some of these associated confidence intervals might well have excluded clinically important differences, although this was not transparent in the abstract [21].

[40] Article: “The Most Objective Evidence Shows No Indication That Covid Vaccines Save More Lives Than They Take. By James D. Agresti. Just Facts, March 2, 2022. <www.justfactsdaily.com>

In this case, the “intervention” is FDA-approved Covid vaccines, and the “outcome” is death. That vital data was gathered in RCTs involving 72,663 adults and older children for the Moderna and Pfizer vaccines. However, the FDA presented these results in a place and manner likely to be overlooked, and no major media outlet has covered them.

The results reveal that 70 people died during the Moderna and Pfizer trials, including 37 who received Covid vaccines and 33 who did not. Combined with the fact that half of the study participants were given vaccinations and the other half were given placebos, these crucial results provide no indication that the vaccines save more lives than they take.

Accounting for sampling margins of error—as is common for medical journals and uncommon for the media—the results demonstrate with 95% confidence that:

• neither of the vaccines decreased or increased the absolute risk of death by any more than 0.08% over the course of the trials.

• the vaccines could prevent up to two deaths or cause up to three deaths per year among every 1,000 people.

[41] Book: Multiple Regression: A Primer. By Paul D. Allison. Pine Forge Press, 1998.

Chapter 3: “What Can Go Wrong With Multiple Regression?” <us.sagepub.com>

Pages 57-58:

Sample size has a profound effect on tests of statistical significance. With a sample of 60 people, a correlation has to be at least .25 (in magnitude) to be significantly different from zero (at the .05 level). With a sample of 10,000 people, any correlation larger than .02 will be statistically significant. The reason is simple: There’s very little information in a small sample, so estimates of correlations are very unreliable. If we get a correlation of .20, there may still be a good chance that the true correlation is zero. …

Statisticians often describe small samples as having low power to test hypotheses. There is another, entirely different problem with small samples that is frequently confused with the issue of power. Most of the test statistics that researchers use (such as t tests, F tests, and chi-square tests) are only approximations. These approximations are usually quite good when the sample is large but may deteriorate markedly when the sample is small. That means that p values calculated for small samples may be only rough approximations of the true p values. If the calculated p value is .02, the true value might be something like .08. …

That brings us to the inevitable question: What’s a big sample and what’s a small sample? As you may have guessed, there’s no clear-cut dividing line. Almost anyone would consider a sample less than 60 to be small, and virtually everyone would agree that a sample of 1,000 or more is large. In between, it depends on a lot of factors that are difficult to quantify, at least in practice.

[42] Article: “Regulatory Scientists Are Quiet About EUA, Kids Vax, Paxlovid and Boosters.” By Dr. Vinay Prasad. <vinayprasadmdmph.substack.com>

Many scientists made a career fighting for better regulatory standards. Strangely, when it comes to the regulatory policy around COVID-19, they are dead quiet. …

Regulatory experts have told us for year[s] that if outcomes are generally favorable, you need a very large randomized control trial to show a benefit. …

… Boosting 20-year-olds should not come under the auspices of an EUA [emergency use authorization]. You should do a very large randomized trial to show it has a benefit. And if you can’t run the trial because the sample size is too large that tells you something about how marginal the effect size is.

[43] Article: “FDA Violated Own Safety and Efficacy Standards in Approving Covid-19 Vaccines For Children.” By James D. Agresti. Just Facts, July 14, 2022. <www.justfactsdaily.com>

That doesn’t mean the vaccine doesn’t work, but there is no way to be sure. This is because the study was underpowered, a medical term for clinical trials that don’t enroll enough participants to detect important effects. Beyond severe Covid and hospitalizations for it, the Pfizer and Moderna trials were also too underpowered to measure:

• overall hospitalizations, which are far more informative than hospitalizations for Covid because they also measure the side effects of the vaccines.

• all-cause mortality, which is the only objective way to be certain the vaccines save more lives than they take.

To determine the last of those measures with 95% confidence would require a trial with more than half a billion children for a full year. And that assumes the vaccine works flawlessly by preventing all Covid deaths and causing no deaths from side effects. This astronomically large number is needed because deaths from Covid-19 are extremely rare among children, amounting to about one out of every 500,000 children in the first year of pandemic. In fact, children are about 36 times more likely to die of accidents than Covid-19.

Microscopically smaller than an adequate study, the Moderna vaccine trials for children aged 6 months to 5 years included a total of 6,388 children with a median blinded follow-up time of 68–71 days after the second dose. The Pfizer trial was similarly sized.

Comparing the data above, the trials that were conducted would need to be about 400,000 times larger/longer to objectively determine if the vaccines save more toddlers and preschoolers than they kill.

[44] Commentary: “Scientists Rise Up Against Statistical Significance.” By Valentin Amrhein, Sander Greenland, and Blake McShane. Nature, March 20, 2019. <www.nature.com>

In 2016, the American Statistical Association released a statement in The American Statistician warning against the misuse of statistical significance and P values. The issue also included many commentaries on the subject. This month, a special issue in the same journal attempts to push these reforms further. It presents more than 40 papers on ‘Statistical inference in the 21st century: a world beyond P < 0.05’. The editors introduce the collection with the caution “don’t say ‘statistically significant’”3. Another article4 with dozens of signatories also calls on authors and journal editors to disavow those terms.

We agree, and call for the entire concept of statistical significance to be abandoned.

We are far from alone. When we invited others to read a draft of this comment and sign their names if they concurred with our message, 250 did so within the first 24 hours. A week later, we had more than 800 signatories—all checked for an academic affiliation or other indication of present or past work in a field that depends on statistical modelling….

[45] Article: “Statistical Significance.” By Michael McDonough. Encyclopaedia Britannica. Last updated October 15, 2024. <www.britannica.com>

Since its conception in the 18th century, statistical significance has become the gold standard for establishing the validity of a result. Statistical significance does not imply the size, importance, or practicality of an outcome; it simply indicates that the outcome’s difference from a baseline is not due to chance. …

A growing number of researchers have voiced concerns over the misinterpretation of, and overreliance on, statistical significance. Often, analysis ends once an observation has been deemed to be statistically significant, and the observation is treated as evidence of an effect.

[46] Paper: “A Review of High Impact Journals Found That Misinterpretation of Non-Statistically Significant Results From Randomized Trials Was Common.” By Karla Hemming, Iqra Javid, and Monica Taljaard. Journal of Clinical Epidemiology, May 2022. <www.sciencedirect.com>

The first and most problematic issue is when inconclusive trials are interpreted as providing definitive evidence that the treatment under evaluation is ineffective [10]. This is referred to as conflating no evidence of a difference with evidence of no difference (i.e., conflating absence of evidence with evidence of absence) [1]. …

Almost all abstracts of RCTs [randomized controlled trials] published in high impact journals with non-statistically significant primary outcomes appropriately report treatment effects and confidence intervals, yet most make definitive conclusions about active treatments being no different to the comparator treatment, despite this being prima facia [at first sight] inconsistent with a non-statistically significant primary outcome result. … In addition, a large number of studies unhelpfully provide no informative interpretation: in the overall conclusion they simply state that the result is non-statistically significant, despite having reported confidence intervals in the results section. … Clear statements that the study finding is inconclusive (i.e., when the confidence interval provides support for both benefit and harm) in reports of RCTs in high impact journals are rare. Despite high profile campaigns in 2016 to put a stop to this poor practice [38], our review demonstrates that the practice of misinterpretation is still highly prevalent. …

Thus, it might be possible that some studies which reported an overall interpretation of no difference between the two treatment arms were correct in this interpretation: some of these associated confidence intervals might well have excluded clinically important differences, although this was not transparent in the abstract [21].

[47] Textbook: Statistics: Concepts and Controversies (6th edition). By David S. Moore and William I. Notz. W. H. Freeman and Company, 2006.

Page 42: “It is usual to report the margin of error for 95% confidence. If a news report gives a margin of error but leaves out the confidence level, it’s pretty safe to assume 95% confidence.”

[48] Book: Statistics for K–8 Educators. By Robert Rosenfeld. Routledge, 2013.

Page 91:

Why 95%? Why not some other percentage? This value gives a level of confidence that has been found convenient and practical for summarizing survey results. There is nothing inherently special about it. If you are willing to change from 95% to some other level of confidence, and consequently change the chances that your poll results are off from the truth, you will therefore change the resulting margin of error. At present, 95% is just the level that is commonly used in a great variety of polls and research projects.

[49] Article: “Statistical Significance.” By Michael McDonough. Encyclopaedia Britannica. Last updated October 15, 2024. <www.britannica.com/topic/statistical-significance>

A growing number of researchers have voiced concerns over the misinterpretation of, and overreliance on, statistical significance. Often, analysis ends once an observation has been deemed to be statistically significant, and the observation is treated as evidence of an effect. This tendency is especially problematic given that statistical significance is not equal to clinical significance, a measure of effect size and practical importance. In an experiment, a statistically significant result simply indicates that a difference exists between two groups. This difference might be incredibly small, but, without further testing, its practical impact is unknown.

[50] Paper: “Beyond Statistical Significance: Clinical Interpretation of Rehabilitation Research Literature.” By Phil Page. International Journal of Sports Physical Therapy, October 9, 2014. <pmc.ncbi.nlm.nih.gov>

While most research focus on statistical significance, clinicians and clinical researchers should focus on clinically significant changes. A study outcome can be statistically significant, but not be clinically significant, and vice‐versa. Unfortunately, clinical significance is not well defined or understood, and many research consumers mistakenly relate statistically significant outcomes with clinical relevance. Clinically relevant changes in outcomes are identified (sometimes interchangeably) by several similar terms including “minimal clinically important differences (MCID)”, “clinically meaningful differences (CMD)”, and “minimally important changes (MIC)”.

In general, these terms all refer to the smallest change in an outcome score that is considered “important” or “worthwhile” by the practitioner or the patient8 and/or would result in a change in patient management9,10. Changes in outcomes exceeding these minimal values are considered clinically relevant. It is important to consider that both harmful changes and beneficial changes may be outcomes of treatment; therefore, the term “clinically‐important changes” should be used to identify both minimal and beneficial differences, but also to recognize harmful changes.

[51] Commentary: “Scientists Rise Up Against Statistical Significance.” By Valentin Amrhein, Sander Greenland, and Blake McShane. Nature, March 20, 2019. <www.nature.com>

How do statistics so often lead scientists to deny differences that those not educated in statistics can plainly see? For several generations, researchers have been warned that a statistically non-significant result does not ‘prove’ the null hypothesis (the hypothesis that there is no difference between groups or no effect of a treatment on some measured outcome)1. …

These and similar errors are widespread. Surveys of hundreds of articles have found that statistically non-significant results are interpreted as indicating ‘no difference’ or ‘no effect’ in around half …

… Neither should we conclude that two studies conflict because one had a statistically significant result and the other did not. These errors waste research efforts and misinform policy decisions.

For example, consider a series of analyses of unintended effects of anti-inflammatory drugs2. Because their results were statistically non-significant, one set of researchers concluded that exposure to the drugs was “not associated” with new-onset atrial fibrillation (the most common disturbance to heart rhythm) and that the results stood in contrast to those from an earlier study with a statistically significant outcome.

Now, let’s look at the actual data. The researchers describing their statistically non-significant results found a risk ratio of 1.2 (that is, a 20% greater risk in exposed patients relative to unexposed ones). They also found a 95% confidence interval that spanned everything from a trifling risk decrease of 3% to a considerable risk increase of 48% (P = 0.091; our calculation). The researchers from the earlier, statistically significant, study found the exact same risk ratio of 1.2. That study was simply more precise, with an interval spanning from 9% to 33% greater risk (P = 0.0003; our calculation).

It is ludicrous to conclude that the statistically non-significant results showed “no association,” when the interval estimate included serious risk increases; it is equally absurd to claim these results were in contrast with the earlier results showing an identical observed effect.

[52] Entry: “underpowered clinical trial.” Segen’s Medical Dictionary, 2012. <medical-dictionary.thefreedictionary.com>

“A clinical trial that has so few patients in each arm that the results will fall short of the statistical power needed to provide valid answers.”

[53] Paper: “A Review of High Impact Journals Found That Misinterpretation of Non-Statistically Significant Results From Randomized Trials Was Common.” By Karla Hemming, Iqra Javid, and Monica Taljaard. Journal of Clinical Epidemiology, May 2022. <www.sciencedirect.com>

The first and most problematic issue is when inconclusive trials are interpreted as providing definitive evidence that the treatment under evaluation is ineffective [10]. This is referred to as conflating no evidence of a difference with evidence of no difference (i.e., conflating absence of evidence with evidence of absence) [1]. …

Almost all abstracts of RCTs [randomized controlled trials] published in high impact journals with non-statistically significant primary outcomes appropriately report treatment effects and confidence intervals, yet most make definitive conclusions about active treatments being no different to the comparator treatment, despite this being prima facia [at first sight] inconsistent with a non-statistically significant primary outcome result. … In addition, a large number of studies unhelpfully provide no informative interpretation: in the overall conclusion they simply state that the result is non-statistically significant, despite having reported confidence intervals in the results section. … Clear statements that the study finding is inconclusive (i.e., when the confidence interval provides support for both benefit and harm) in reports of RCTs in high impact journals are rare. Despite high profile campaigns in 2016 to put a stop to this poor practice [38], our review demonstrates that the practice of misinterpretation is still highly prevalent.

[54] Commentary: “Scientists Rise Up Against Statistical Significance.” By Valentin Amrhein, Sander Greenland, and Blake McShane. Nature, March 20, 2019. <www.nature.com>

Nor do statistically significant results ‘prove’ some other hypothesis. Such misconceptions have famously warped the literature with overstated claims and, less famously, led to claims of conflicts between studies where none exists.

[55] Paper: “Sea-Level Rise for the Coasts of California, Oregon, and Washington: Past, Present, and Future.” By the Committee on Sea Level Rise in California, Oregon, and Washington, National Research Council. National Academies Press, 2012. <www.nap.edu>

Pages 3–4:

Projections of global sea-level rise are generally made using models of the ocean-atmosphere- climate system, extrapolations, or semi-empirical methods. Ocean-atmosphere models are based on knowledge of the physical processes that contribute to sea-level rise, and they predict the response of those processes to different scenarios of future greenhouse gas emissions. These models provide a reasonable estimate of the water density (steric) component of sea-level rise (primarily thermal expansion), but they underestimate the land ice contribution because they do not fully account for rapid changes in the behavior of ice sheets and glaciers as melting occurs (ice dynamics). The IPCC (2007) projections were made using this method, and they are likely too low, even with an added ice dynamics component. Estimates of the total land ice contribution can be made by extrapolating current observations of ice loss rates from glaciers, ice caps, and ice sheets into the future. Extrapolations of future ice melt are most reliable for time frames in which the dynamics controlling behavior are stable, in this case, up to several decades. Semi-empirical methods, exemplified by Vermeer and Rahmstorf (2009), avoid the difficulty of estimating the individual contributions to sea-level rise by simply postulating that sea level rises faster as the Earth gets warmer. This approach reproduces the sea-level rise observed in the past, but reaching the highest projections would require acceleration of glaciological processes to levels not previously observed or understood as realistic. ….

Given the strengths and weaknesses of the different projection methods, as well as the resource constraints of an NRC study, the committee chose a combination of approaches for its projections. The committee projected the steric component of sea-level rise using output from global ocean models under an IPCC (2007) mid-range greenhouse gas emission scenario. The land ice component was extrapolated using the best available compilations of ice mass accumulation and loss (mass balance), which extend from 1960 to 2005 for glaciers and ice caps, and from 1992 to 2010 for the Greenland and Antarctic ice sheets. The contributions were then summed. The committee did not project the land hydrology contribution because available estimates suggested that the sum of groundwater extraction and reservoir storage is near zero, within large uncertainties.

[56] Textbook: Flood Geomorphology. By Victor R. Baker and others. Wiley, April 1998.

Page ix:

[T]rue science is concerned with understanding nature no matter what the methodology. In our view, if the wrong equations are programmed because of inadequate understanding of the system, then what the computer will produce, if believed by the analyst, will constitute the opposite of science.

[57] Paper: “The Use and Misuse of Models for Climate Policy.” By Robert S. Pindyck. Review of Environmental Economics and Policy, March 11, 2017. <journals.uchicago.edu>

In a recent article (Pindyck 2013a), I argued that integrated assessment models (IAMs) “have crucial flaws that make them close to useless as tools for policy analysis” (page 860). In fact, I would argue that the problem goes beyond their “crucial flaws”: IAM-based analyses of climate policy create a perception of knowledge and precision that is illusory and can fool policymakers into thinking that the forecasts the models generate have some kind of scientific legitimacy. …

The argument is sometimes made that we have no choice—that without a model we will end up relying on biased opinions, guesswork, or even worse. Thus we must develop the best models possible and then use them to evaluate alternative policies. In other words, the argument is that working with even a highly imperfect model is better than having no model at all. This might be a valid argument if we were honest and up-front about the limitations of the model. But often we are not.

[58] Report: “Face Coverings in the Community and COVID-19: A Rapid Review.” Public Health England, June 26, 2020. <www.justfacts.com>

Page 6:

Part of the limitations of modelling studies is that they must make assumptions in cases where the evidence or data are lacking. For example, models used different parameters to define ‘effectiveness’ of masks, which ranged from an 8% (24) reduction in risk to >95% (29) reduction in risk. The nature of modelling studies also means that simulations are run in controlled environments that may not accurately reflect the behaviours that we observe in real life. Unless controlled for, parameters can be fixed that are usually variable.

Pages 7–8:

[M]odelling and laboratories studies provide only theoretical evidence…. We, therefore, cannot recommend the use of modelling studies alone as evidence to inform or change policy measures.

[59] Commentary: “Five Ways to Ensure That Models Serve Society: A Manifesto.” By Andrea Saltelli and others. Nature, June 24, 2020. <www.nature.com>

Now, computer modelling is in the limelight, with politicians presenting their policies as dictated by ‘science’2. Yet there is no substantial aspect of this pandemic for which any researcher can currently provide precise, reliable numbers. Known unknowns include the prevalence and fatality and reproduction rates of the virus in populations. There are few estimates of the number of asymptomatic infections, and they are highly variable. We know even less about the seasonality of infections and how immunity works, not to mention the impact of social-distancing interventions in diverse, complex societies.

Mathematical models produce highly uncertain numbers that predict future infections, hospitalizations and deaths under various scenarios. Rather than using models to inform their understanding, political rivals often brandish them to support predetermined agendas. To make sure predictions do not become adjuncts to a political cause, modellers, decision makers and citizens need to establish new social norms. Modellers must not be permitted to project more certainty than their models deserve; and politicians must not be allowed to offload accountability to models of their choosing2,3.

[60] Paper: “Risk of Bias in Model-Based Economic Evaluations: The ECOBIAS Checklist. By Charles Christian Adarkwah and others. Expert Review of Pharmacoeconomics & Outcomes Research, November 20, 2015. <www.researchgate.net>

Page 1:

Economic evaluations are becoming increasingly important in providing policymakers with information for reimbursement decisions. However, in many cases, there is a significant difference between theoretical study results and real-life observations. This can be due to confounding factors or many other variables, which could be significantly affected by bias. …

There are basically two analytical frameworks used to conduct economic evaluations: model-based and trial-based. In a model-based economic evaluation, data from a wide range of sources [e.g., randomized-controlled trials (RCTs)], meta-analyses, observational studies) are combined using a mathematical model to represent the complexity of a healthcare process.

Page 6:

This study identified several additional biases related to model-based economic evaluation and showed that the impact of these biases could be massive, changing the outcomes from being highly cost-effective to not being cost-effective at all.

[61] Paper: “Economic Evaluations in Fracture Research an Introduction with Examples of Foot Fractures.” By Noortje Anna Clasina van den Boom and others. Injury, March 2022. <www.sciencedirect.com>

The lack of reliable data in the field of economic evaluation fractures could be explained by the lack of reliable literature to base the models on. Since model based studies are the most common design in this field of research, this problem is significant.

[62] Paper: “Comparison of Health and Health Risk Factors Between Lesbian, Gay, and Bisexual Adults and Heterosexual Adults in the United States.” By Gilbert Gonzales, Julia Przedworski, and Carrie Henning-Smith. Journal of the American Medical Association, June 27, 2016. <archinte.jamanetwork.com>

Finally, the NHIS [National Health Interview Survey] is a cross-sectional survey and cannot definitively establish the causal directions of the observed associations because cross-sectional studies are prone to omitted variable bias. Missing and unmeasured variables—such as exposure to discrimination or nondisclosure of sexual orientation to family, friends, and health care professionals—may provide alternative explanations for the association between sexual orientation and health outcomes.

NOTE: See the next footnote, where the lead author of this study makes a causal inference about the study.

[63] Article: “Survey Finds Excess Health Problems in Lesbians, Gays, Bisexuals.” By Andrew M. Seaman. Reuters, June 28, 2016. <ca.news.yahoo.com>

Gilbert Gonzales of the Vanderbilt University School of Medicine in Nashville and colleagues found that compared to heterosexual women, lesbians were 91 percent more likely to report poor or fair health. Lesbians were 51 percent more likely, and bisexual women were more than twice as likely, to report multiple chronic conditions, compared to straight women. …

Gonzales told Reuters Health that the health disparities are likely due to the stress of being a minority, which is likely exacerbated among bisexual people, who may not be accepted by lesbian, gay, bisexual and transgender communities.

[64] Paper: “Association Is Not Causation: Treatment Effects Cannot Be Estimated From Observational Data in Heart Failure.” By Christopher J Rush and others. European Heart Journal, October 2018. <academic.oup.com>

This comprehensive comparison of studies of non-randomized data with the findings of RCTs [randomized controlled trials] in HF [heart failure] shows that it is not possible to make reliable therapeutic inferences from observational associations.

[65] Textbook: Principles and Practice of Clinical Research. By John I. Gallin and ‎Frederick P. Ognibene. Academic Press, 2012.

Page 226: “While consistency in the findings of a large number of observational studies can lead to the belief that the associations are causal, this belief is a fallacy.”

[66] Book: Introductory Econometrics: Using Monte Carlo Simulation with Microsoft Excel. By Humberto Barreto and Frank M. Howland. Cambridge University Press, 2006.

Page 491:

Omitted variable bias is a crucial topic because almost every study in econometrics is an observational study as opposed to a controlled experiment. Very often, economists would like to be able to interpret the comparisons they make as if they were the outcomes of controlled experiments. In a properly conducted controlled experiment, the only systematic difference between groups results from the treatment under investigation; all other variation stems from chance. In an observational study, because the participants self-select into groups, it is always possible that varying average outcomes between groups result from systematic difference between groups other than the treatment. We can attempt to control for these systematic differences by explicitly incorporating variables in a regression. Unfortunately, if not all of those differences have been controlled for in the analysis, we are vulnerable to the devastating effects of omitted variable bias.

[67] Book: Regression With Social Data: Modeling Continuous and Limited Response Variables. By Alfred DeMaris. John Wiley & Sons, 2004.

Page 9:

Regression modeling of nonexperimental data for the purpose of making causal inferences is ubiquitous in the social sciences. Sample regression coefficients are typically thought of as estimates of the causal impacts of explanatory variables on the outcome. Even though researchers may not acknowledge this explicitly, their use of such language as impact or effect to describe a coefficient value often suggest a causal interpretation. This practice is fraught with controversy….

Page 12:

Friedman … is especially critical of drawing causal inferences from observational data, since all that can be “discovered,” regardless of the statistical candlepower used, is association. Causation has to be assumed into the structure from the beginning. Or, as Friedman … says: “If you want to pull a causal rabbit out of the hat, you have to put the rabbit into the hat.” In my view, this point is well taken; but it does not preclude using regression for causal inference. What it means, instead, is that prior knowledge of the causal status of one’s regressors is a prerequisite for endowing regression coefficients with a causal interpretation, as acknowledged by Pearl 1998.

Page 13:

In sum, causal modeling via regression, using nonexperimental data, can be a useful enterprise provided we bear in mind that several strong assumptions are required to sustain it. First, regardless of the sophistication of our methods, statistical techniques only allow us to examine associations among variables.