P-Value Explained
By CalcMulti Editorial Team··9 min read
The p-value is the most widely reported — and most widely misunderstood — number in statistics. It appears in every scientific paper, clinical trial, and A/B test result, yet surveys consistently show that even researchers who use it daily cannot correctly define it.
This guide explains exactly what a p-value is, what "p < 0.05" actually means, what it does NOT mean, and how to use it correctly alongside effect sizes and confidence intervals.
The Exact Definition of a P-Value
A p-value is the probability of obtaining a test statistic as extreme as — or more extreme than — the one observed, assuming the null hypothesis is true.
Worked example: You test whether a coin is fair. You flip it 100 times and get 60 heads. Under H₀ (fair coin, p = 0.5), what is the probability of getting 60 or more heads? That probability is the p-value. If p = 0.028, it means: "If the coin were truly fair, there is only a 2.8% chance of seeing 60+ heads in 100 flips."
Critically: the p-value is a probability about the data given H₀ — it is not a probability about H₀ given the data. This distinction is the source of most p-value misinterpretations.
What "p < 0.05" Means (and Does Not Mean)
What it means: If the null hypothesis were true, data as extreme as yours would occur less than 5% of the time by chance. You reject H₀ because this is surprising enough under H₀.
What it does NOT mean — the 6 most common misconceptions:
❌ "p < 0.05 means there is a 95% chance the result is real." Wrong. The p-value says nothing about the probability that H₀ is true or false.
❌ "p = 0.04 is a stronger result than p = 0.049." Both are below 0.05. The difference is noise, not signal.
❌ "p > 0.05 means no effect exists." It means insufficient evidence to reject H₀ — absence of evidence is not evidence of absence.
❌ "A small p-value means a large effect." A tiny effect with a huge sample (n = 100,000) can produce p < 0.001. Always report effect size alongside p-value.
❌ "p-value is the probability my hypothesis is wrong." The p-value is computed assuming H₀ is true — it says nothing about the probability of your hypothesis.
❌ "The 0.05 threshold is a law of nature." It is an arbitrary convention established by Ronald Fisher in 1925. Many journals now require p < 0.005 or reporting the exact p-value.
How Is the P-Value Calculated?
The calculation depends on which statistical test you run. The general process is: (1) compute a test statistic from your data, (2) find the probability that the test statistic is this extreme under the null distribution.
For a one-sample t-test: t = (x̄ − μ₀) / (s / √n). Then find P(|T| ≥ |t|) where T follows a t-distribution with df = n − 1.
For a chi-square test: χ² = Σ(O − E)²/E. Then find P(χ² ≥ χ²_observed) under the chi-square distribution with the appropriate degrees of freedom.
Modern calculators (including this site's p-value calculator) compute this automatically. The key is choosing the correct test for your data type and research question.
| Test | Data type | Null hypothesis | Test statistic |
|---|---|---|---|
| One-sample t-test | Continuous, one group | Population mean = μ₀ | t = (x̄ − μ₀)/(s/√n) |
| Two-sample t-test | Continuous, two groups | Means are equal | t = (x̄₁ − x̄₂)/SE_diff |
| Chi-square test | Categorical | Variables are independent | χ² = Σ(O−E)²/E |
| Correlation test | Two continuous variables | Correlation ρ = 0 | t = r√(n−2)/√(1−r²) |
| ANOVA F-test | Continuous, 3+ groups | All group means equal | F = MS_between/MS_within |
Significance Levels (α) — The Threshold Choice
The significance level α is the threshold below which you reject H₀. You set α before collecting data — never after seeing the results.
α = 0.05 (5%): the standard in social sciences, biology, and most A/B testing. Means you accept a 5% false positive rate (Type I error rate).
α = 0.01 (1%): stricter, used in medical trials, genomics, and high-stakes decisions. Reduces false positives but increases false negatives (Type II errors).
α = 0.001 (0.1%): very strict, used in particle physics ("5-sigma rule" ≈ p < 0.0000003) and genome-wide association studies.
The choice of α is a decision about how much Type I error risk you are willing to accept. It should reflect the cost of false positives in your field — not the 0.05 tradition.
| α level | Interpretation | Common use |
|---|---|---|
| 0.10 | 10% false positive rate | Exploratory research, pilot studies |
| 0.05 | 5% false positive rate | Social sciences, A/B testing, most clinical research |
| 0.01 | 1% false positive rate | Medical trials, psychology replication |
| 0.001 | 0.1% false positive rate | Genomics, physics, high-stakes decisions |
| < 3×10⁻⁷ | "5-sigma" ≈ 0.0000003 | Particle physics (Higgs boson standard) |
Beyond the P-Value — What Else You Need
A p-value alone is never enough. Modern statistics and major journal guidelines require reporting three things together: the p-value, the effect size, and the confidence interval.
Effect size answers "how large is the effect?" not just "is there an effect?" Cohen's d < 0.2 is small, 0.5 is medium, 0.8 is large. A drug that reduces blood pressure by 0.1 mmHg can produce p < 0.001 in a large trial — but 0.1 mmHg is clinically meaningless.
Confidence interval shows the range of plausible values for the true effect. A 95% CI of [0.02, 0.04] is very different from [−0.5, 5.0] even if both have p < 0.05. Wide intervals indicate uncertainty; narrow intervals indicate precision.
The replication crisis in psychology and medicine (2010s onward) was largely caused by over-reliance on p-values without effect sizes or power calculations. Many effects with p < 0.05 failed to replicate because the sample sizes were small and the effects were smaller than originally reported.
P-Hacking and How to Avoid It
P-hacking is the practice of running multiple analyses and reporting only the one with p < 0.05. If you run 20 independent tests at α = 0.05, you expect 1 false positive by chance — yet that single "significant" result looks like a real finding.
Common p-hacking patterns: adding more subjects until p < 0.05; trying different outcome variables until one is significant; removing outliers selectively; testing multiple subgroups and reporting only the one that "worked"; switching from two-tailed to one-tailed after seeing results.
Defenses: pre-register your analysis plan before collecting data (AsPredicted.org, OSF). Apply Bonferroni correction or Benjamini-Hochberg correction for multiple comparisons. Report all tests run, not just significant ones. Calculate required sample size in advance using power analysis.
Related Calculators
Calculate p-values from z, t, or chi-square statistics
Hypothesis Testing BasicsThe 5-step hypothesis testing process
T-Test CalculatorOne and two-sample t-tests
Effect Size CalculatorCohen's d and other effect size measures
Type I vs Type II ErrorFalse positives and false negatives explained
Statistics HubAll statistics calculators and guides
Frequently Asked Questions
Educational use only. Content is based on publicly documented mathematical formulas and reviewed for accuracy by the CalcMulti Editorial Team. Last updated: March 2026.