Hypothesis Testing Basics
By CalcMulti Editorial Team··9 min read
Hypothesis testing is the formal procedure for deciding whether data provides enough evidence to reject a claim about a population. It is the engine behind A/B testing, clinical trials, quality control, and virtually all scientific research.
This guide walks through the 5-step process, explains the logic of null and alternative hypotheses, demystifies Type I and Type II errors, and shows you how to choose the right test for your situation.
The Core Logic of Hypothesis Testing
Hypothesis testing uses a proof-by-contradiction approach: assume the effect you are looking for does not exist (the null hypothesis H₀), then ask "how likely is it that I would see data this extreme if H₀ were true?" If that probability is very small (below your chosen significance threshold α), you have evidence to reject H₀.
Key analogy: hypothesis testing works like a legal trial. H₀ is "innocent" — you start by assuming no effect. The data is the evidence. The p-value measures how surprising the evidence is if H₀ were true. Rejecting H₀ means the evidence is strong enough to conclude "guilty" — but you could still be wrong (Type I error), just as convictions can be wrong.
What hypothesis testing does NOT do: it does not prove H₀ is true when you fail to reject it. Failing to reject H₀ means "insufficient evidence against it," not "H₀ is proven correct." It also does not measure how large or practically important an effect is — a tiny effect with a huge sample can produce p < 0.05 while being meaningless in practice.
The 5-Step Process
Step 1: State the hypotheses. H₀ (null hypothesis): the status quo or "no effect" claim. H₁ (alternative hypothesis): the effect you want to detect. Example: "A new drug reduces blood pressure" → H₀: μ_treatment = μ_control (no difference); H₁: μ_treatment < μ_control (drug reduces BP). Always state both hypotheses before collecting data.
Step 2: Choose significance level α. α is the probability of rejecting H₀ when it is actually true (Type I error rate). Standard choice: α = 0.05 (5%). Stricter choices: α = 0.01 (1%) for medical studies or α = 0.001 (0.1%) for high-stakes decisions. Also decide: one-tailed (directional test) or two-tailed (any difference).
Step 3: Collect data and compute the test statistic. The test statistic measures how far your sample result is from what H₀ predicts, in standard error units. For a one-sample mean test: t = (x̄ − μ₀) / (s / √n). For a proportion test: z = (p̂ − p₀) / √(p₀(1−p₀)/n). For categorical data: χ² = Σ(O−E)²/E.
Step 4: Find the p-value. The p-value is P(test statistic this extreme | H₀ is true). Look it up from the t-distribution, z-distribution, or chi-square distribution using your test statistic and degrees of freedom. Modern software and online calculators do this automatically.
Step 5: Make a decision and interpret. If p ≤ α: reject H₀ — the result is statistically significant. If p > α: fail to reject H₀ — insufficient evidence. Always report the exact p-value, effect size, and confidence interval — not just "significant/not significant."
Type I and Type II Errors
Because hypothesis tests are based on probabilities, they can lead to two types of wrong decisions. Understanding these errors is essential for choosing the right α and sample size.
| H₀ is actually TRUE | H₀ is actually FALSE | |
|---|---|---|
| You reject H₀ | Type I error (False Positive) — rate = α | Correct decision — rate = 1 − β (Power) |
| You fail to reject H₀ | Correct decision — rate = 1 − α | Type II error (False Negative) — rate = β |
Understanding Type I and Type II Errors
Type I error (false positive, rate = α): Rejecting H₀ when it is actually true. Example: concluding a drug works when it is actually no better than placebo. The significance level α directly controls this error rate — setting α = 0.05 means you accept a 5% chance of a false positive. Reduce Type I errors by lowering α, but this increases Type II errors.
Type II error (false negative, rate = β): Failing to reject H₀ when the alternative is actually true. Example: concluding a drug does not work when it actually does. The rate β is controlled by statistical power (= 1 − β). High power means a low chance of missing a real effect. Increase power by: increasing sample size, increasing the effect size being tested, or raising α.
The tradeoff: Lowering α reduces Type I errors but increases Type II errors (harder to detect real effects). The standard approach: set α = 0.05, then design the study with sufficient sample size to achieve 80% power (β = 0.20) — meaning an 80% chance of detecting the effect if it truly exists. A power analysis before data collection tells you the minimum sample size needed.
Choosing the Right Statistical Test
The correct test depends on: (1) the type of outcome variable (continuous vs categorical), (2) the number of groups being compared, (3) whether observations are independent or paired, and (4) whether normality assumptions hold.
| Question | Data type | Test to use |
|---|---|---|
| Is mean different from a known value? | Continuous (1 group) | One-sample t-test |
| Are two independent group means different? | Continuous (2 groups) | Independent two-sample t-test (Welch) |
| Are before/after measurements different? | Continuous (paired) | Paired t-test |
| Are 3+ group means different? | Continuous (3+ groups) | One-way ANOVA |
| Is a proportion different from a known value? | Binary (1 proportion) | One-proportion z-test or binomial test |
| Are two proportions different? | Binary (2 groups) | Two-proportion z-test or chi-square |
| Is there an association between two categorical variables? | Categorical (2 variables) | Chi-square test of independence |
| Does the distribution match expected frequencies? | Categorical (counts) | Chi-square goodness-of-fit test |
| Non-normal continuous data, 2 groups | Continuous (non-normal) | Mann-Whitney U test (non-parametric) |
| Non-normal continuous data, 3+ groups | Continuous (non-normal) | Kruskal-Wallis test (non-parametric) |
Worked Example: Full 5-Step Hypothesis Test
Scenario: A coffee shop claims their "large" coffee contains 500ml. Quality control samples 25 large coffees and finds mean = 492ml, SD = 18ml. Is there evidence the fills are below spec?
Step 1 — Hypotheses: H₀: μ = 500ml (fills are on spec). H₁: μ < 500ml (fills are below spec). One-tailed test because we specifically care about under-filling.
Step 2 — Significance level: α = 0.05 (standard business decision threshold).
Step 3 — Test statistic: t = (492 − 500) / (18 / √25) = −8 / (18/5) = −8 / 3.6 = −2.22. df = 25 − 1 = 24.
Step 4 — P-value: P(T < −2.22 | df = 24) ≈ 0.018 (from t-distribution, one-tailed).
Step 5 — Decision: p = 0.018 < α = 0.05 → Reject H₀. There is statistically significant evidence that the mean fill is below 500ml. The 95% one-sided CI: μ < 492 + 1.711 × 3.6 = 498.2ml. The QC manager investigates the filling machine.
Related Calculators
Test if two means differ significantly
P-Value CalculatorCompute p-value from any test statistic
Chi-Square CalculatorTest categorical data distributions
Normal Distribution ExplainedDistribution underlying most tests
Sample Size CalculatorEnsure adequate power before testing
Statistics HubAll statistics calculators & guides
Frequently Asked Questions
Educational use only. Content is based on publicly documented mathematical formulas and reviewed for accuracy by the CalcMulti Editorial Team. Last updated: February 2026.