Does p < 0.05 mean there is a 95% probability the effect is real?

No. This is the most common misconception about p-values. p < 0.05 means: assuming the null hypothesis is true, there is less than a 5% chance of observing data this extreme. It does NOT mean there is a 95% probability the effect is real (that would require knowing the prior probability of H₀ being true — a Bayesian calculation). To get the probability that H₀ is true or false given your data, you need Bayesian analysis.

What is p-hacking and why is it a problem?

P-hacking (also called data dredging) is running many statistical tests on the same data until one reaches p < 0.05, then reporting only the significant result. If you run 20 independent tests at α = 0.05, you expect one false positive purely by chance. P-hacking inflates the Type I error rate and produces non-reproducible findings. Pre-registration, correcting for multiple comparisons (Bonferroni, FDR correction), and transparent reporting of all tests conducted are the solutions.

What is the difference between one-tailed and two-tailed tests?

A two-tailed test asks: 'Is there any difference (in either direction)?' — it splits α between both tails (e.g., 2.5% in each tail for α = 0.05, giving z* = 1.96). A one-tailed test asks: 'Is the effect in a specific direction?' — it puts all of α in one tail (z* = 1.645 for α = 0.05). One-tailed tests are easier to pass (lower critical value) but require strong prior justification. Most peer-reviewed research uses two-tailed tests to avoid bias toward a desired direction.

What is the difference between statistical significance and effect size?

Statistical significance (p-value) asks: 'Could this result be due to chance?' Effect size asks: 'How large is the effect?' Common effect size measures: Cohen's d = (mean difference) / pooled SD (d = 0.2 small, 0.5 medium, 0.8 large); η² = variance explained by the factor; R² = proportion of outcome variance explained by predictors. With large samples, even tiny effects reach statistical significance. Always report effect size to communicate practical importance.

When should I use a stricter significance threshold than 0.05?

Use stricter α when: (1) the cost of a false positive is high (medical treatments, legal decisions, safety-critical systems), (2) you are testing many hypotheses simultaneously (use Bonferroni: α/k, or FDR correction), (3) your field has higher standards (physics: 5σ ≈ 0.0000003; genomics: 5×10⁻⁸), (4) you are exploring data (exploratory analyses) rather than testing pre-specified hypotheses. When the cost of a false negative is high (missing a real treatment effect), consider using α = 0.10 with pre-registration.

Statistical Significance Explained — What p < 0.05 Actually Means

Name: Statistical Significance Explained — What p < 0.05 Really Means
Availability: OnlineOnly
Author: CalcMulti Editorial Team

By CalcMulti Editorial Team·Updated: February 2026·10 min read

Statistical significance is one of the most widely used — and most widely misunderstood — concepts in science, medicine, and data analysis. A result is called "statistically significant" when the probability of observing data at least as extreme as yours, assuming the null hypothesis is true (the p-value), falls below a chosen threshold α (typically 0.05).

But p < 0.05 does not mean the result is important, large, or practically meaningful. It does not mean there is a 95% probability that the effect is real. Understanding what significance does and does not tell you is essential for interpreting research and making good decisions.

What a P-Value Actually Is

The p-value is the probability of observing a test statistic as extreme as (or more extreme than) the one calculated from your data, assuming the null hypothesis is true.

If p = 0.03: assuming H₀ is true, there is a 3% chance of seeing data this extreme by random chance. This seems unlikely enough that we "reject H₀" — but 3% is not 0%. Random coincidences do happen, and with many studies being run, some 3% events will occur.

What a p-value is NOT: It is not the probability that H₀ is true. It is not the probability that your result occurred by chance. It is not the probability that your finding will replicate. It is not a measure of effect size.

p-value	Common Interpretation	What it actually means
< 0.001	Highly significant	Very rare under H₀; but does not tell you effect size
0.001 – 0.01	Strongly significant	Rare under H₀
0.01 – 0.05	Statistically significant	Below conventional threshold
0.05 – 0.10	Marginal / trend	Borderline; often reported cautiously
> 0.10	Not significant	H₀ cannot be rejected at conventional levels

Why 0.05? The Arbitrary Nature of Alpha

The α = 0.05 threshold was popularised by Ronald Fisher in the 1920s as a convenient round number. It means: "I am willing to incorrectly reject H₀ 5% of the time (when it is actually true)." This is the Type I error rate.

The 0.05 threshold is arbitrary. Physics uses α = 0.0000003 (5 sigma) for major discoveries. Genomics uses α = 5 × 10⁻⁸ to account for hundreds of thousands of simultaneous tests. Preliminary research often uses α = 0.10. The threshold should match the cost of a false positive in your domain.

A major problem: with thousands of studies being published, even if every null hypothesis were true, 5% of studies would falsely "find" a significant result at α = 0.05. This is part of why the replication crisis exists in science — many significant results are false positives.

Statistical Significance vs Practical Significance

Statistical significance tells you whether an effect is likely real (not due to chance). Practical significance tells you whether the effect is large enough to matter. These are completely separate questions.

A drug lowers blood pressure by an average of 1 mmHg (95% CI: 0.5–1.5 mmHg). This is statistically significant (p < 0.001, based on a large trial). But a 1 mmHg reduction is clinically trivial — the drug is practically meaningless despite being "significant."

Conversely: A teaching intervention improves test scores by 15 points on a 100-point scale (p = 0.08). This is not statistically significant at α = 0.05 — but a 15-point improvement would be educationally important. With more students, this effect might reach significance.

Always report effect sizes (Cohen's d, η², R²) alongside p-values to communicate both statistical and practical significance.

Scenario	Statistical Significance	Practical Significance	Conclusion
Drug: −1 mmHg blood pressure (n=10,000, p=0.001)	Yes (p < 0.05)	No (trivially small)	Significant but useless
Teaching: +15 pts test score (n=20, p=0.08)	No (p > 0.05)	Yes (large effect)	Possibly important; need more data
Ad: +0.3% click rate (n=1M, p<0.001)	Yes	Context-dependent (revenue?)	Report effect size
Therapy: −8 pts depression scale (n=50, p=0.02)	Yes	Yes (clinically meaningful)	Good evidence of benefit

Type I and Type II Errors

Type I error (false positive): Rejecting H₀ when it is actually true. Probability = α (typically 0.05). You conclude there is an effect when there is none.

Type II error (false negative): Failing to reject H₀ when it is actually false. Probability = β (often 0.20). You miss a real effect.

Statistical power = 1 − β = the probability of detecting a real effect when it exists. A power of 0.80 means you have an 80% chance of finding a significant result when the true effect matches your assumption.

The trade-off: reducing α (stricter threshold) reduces Type I errors but increases Type II errors. The only way to reduce both simultaneously is to increase sample size. Power analysis before a study determines the sample size needed to detect a specified effect size with adequate power.

	H₀ True (no effect)	H₀ False (effect exists)
Reject H₀ (significant)	Type I Error (α)	Correct (Power = 1−β)
Fail to reject H₀	Correct (1−α)	Type II Error (β)

Going Beyond Significance — Better Practices

1. Report effect sizes. Cohen's d for t-tests, η² for ANOVA, R² for regression. These tell you how large the effect is, not just whether it exists.

2. Report confidence intervals. A 95% CI gives the range of plausible effect sizes. "The mean difference is 4.2 (95% CI: 1.8–6.6)" is far more informative than "p = 0.003."

3. Consider power. An underpowered study (n too small) will miss real effects. A result of p = 0.12 from n = 20 does not mean no effect — it may mean insufficient power. Calculate and report your study's power.

4. Pre-register hypotheses. Specifying your hypothesis and analysis plan before collecting data prevents p-hacking (testing many hypotheses until one is significant by chance).

5. Replicate. A single significant result (p < 0.05) is evidence, not proof. The replication crisis has shown that many published findings do not replicate. Significance from a well-powered, pre-registered, replicated study is far more meaningful.

Related Calculators

P-Value Calculator

Calculate p-values from test statistics

Type I vs Type II Error

False positives vs false negatives in depth

Confidence Intervals Explained

Richer than p-values alone

Hypothesis Testing Basics

The 5-step hypothesis testing process

Effect Size Calculator

Cohen's d, η², and more

Statistics Hub

All statistics calculators & guides

← Back to Statistics Hub

Frequently Asked Questions

Educational use only. Content is based on publicly documented mathematical formulas and reviewed for accuracy by the CalcMulti Editorial Team. Last updated: February 2026.