Hypothesis Testing Basics

By CalcMulti Editorial Team··9 min read

Hypothesis testing is the formal procedure for deciding whether data provides enough evidence to reject a claim about a population. It is the engine behind A/B testing, clinical trials, quality control, and virtually all scientific research.

This guide walks through the 5-step process, explains the logic of null and alternative hypotheses, demystifies Type I and Type II errors, and shows you how to choose the right test for your situation.

The Core Logic of Hypothesis Testing

Hypothesis testing uses a proof-by-contradiction approach: assume the effect you are looking for does not exist (the null hypothesis H₀), then ask "how likely is it that I would see data this extreme if H₀ were true?" If that probability is very small (below your chosen significance threshold α), you have evidence to reject H₀.

Key analogy: hypothesis testing works like a legal trial. H₀ is "innocent" — you start by assuming no effect. The data is the evidence. The p-value measures how surprising the evidence is if H₀ were true. Rejecting H₀ means the evidence is strong enough to conclude "guilty" — but you could still be wrong (Type I error), just as convictions can be wrong.

What hypothesis testing does NOT do: it does not prove H₀ is true when you fail to reject it. Failing to reject H₀ means "insufficient evidence against it," not "H₀ is proven correct." It also does not measure how large or practically important an effect is — a tiny effect with a huge sample can produce p < 0.05 while being meaningless in practice.

The 5-Step Process

Step 1: State the hypotheses. H₀ (null hypothesis): the status quo or "no effect" claim. H₁ (alternative hypothesis): the effect you want to detect. Example: "A new drug reduces blood pressure" → H₀: μ_treatment = μ_control (no difference); H₁: μ_treatment < μ_control (drug reduces BP). Always state both hypotheses before collecting data.

Step 2: Choose significance level α. α is the probability of rejecting H₀ when it is actually true (Type I error rate). Standard choice: α = 0.05 (5%). Stricter choices: α = 0.01 (1%) for medical studies or α = 0.001 (0.1%) for high-stakes decisions. Also decide: one-tailed (directional test) or two-tailed (any difference).

Step 3: Collect data and compute the test statistic. The test statistic measures how far your sample result is from what H₀ predicts, in standard error units. For a one-sample mean test: t = (x̄ − μ₀) / (s / √n). For a proportion test: z = (p̂ − p₀) / √(p₀(1−p₀)/n). For categorical data: χ² = Σ(O−E)²/E.

Step 4: Find the p-value. The p-value is P(test statistic this extreme | H₀ is true). Look it up from the t-distribution, z-distribution, or chi-square distribution using your test statistic and degrees of freedom. Modern software and online calculators do this automatically.

Step 5: Make a decision and interpret. If p ≤ α: reject H₀ — the result is statistically significant. If p > α: fail to reject H₀ — insufficient evidence. Always report the exact p-value, effect size, and confidence interval — not just "significant/not significant."

Type I and Type II Errors

Because hypothesis tests are based on probabilities, they can lead to two types of wrong decisions. Understanding these errors is essential for choosing the right α and sample size.

H₀ is actually TRUEH₀ is actually FALSE
You reject H₀Type I error (False Positive) — rate = αCorrect decision — rate = 1 − β (Power)
You fail to reject H₀Correct decision — rate = 1 − αType II error (False Negative) — rate = β

Understanding Type I and Type II Errors

Type I error (false positive, rate = α): Rejecting H₀ when it is actually true. Example: concluding a drug works when it is actually no better than placebo. The significance level α directly controls this error rate — setting α = 0.05 means you accept a 5% chance of a false positive. Reduce Type I errors by lowering α, but this increases Type II errors.

Type II error (false negative, rate = β): Failing to reject H₀ when the alternative is actually true. Example: concluding a drug does not work when it actually does. The rate β is controlled by statistical power (= 1 − β). High power means a low chance of missing a real effect. Increase power by: increasing sample size, increasing the effect size being tested, or raising α.

The tradeoff: Lowering α reduces Type I errors but increases Type II errors (harder to detect real effects). The standard approach: set α = 0.05, then design the study with sufficient sample size to achieve 80% power (β = 0.20) — meaning an 80% chance of detecting the effect if it truly exists. A power analysis before data collection tells you the minimum sample size needed.

Choosing the Right Statistical Test

The correct test depends on: (1) the type of outcome variable (continuous vs categorical), (2) the number of groups being compared, (3) whether observations are independent or paired, and (4) whether normality assumptions hold.

QuestionData typeTest to use
Is mean different from a known value?Continuous (1 group)One-sample t-test
Are two independent group means different?Continuous (2 groups)Independent two-sample t-test (Welch)
Are before/after measurements different?Continuous (paired)Paired t-test
Are 3+ group means different?Continuous (3+ groups)One-way ANOVA
Is a proportion different from a known value?Binary (1 proportion)One-proportion z-test or binomial test
Are two proportions different?Binary (2 groups)Two-proportion z-test or chi-square
Is there an association between two categorical variables?Categorical (2 variables)Chi-square test of independence
Does the distribution match expected frequencies?Categorical (counts)Chi-square goodness-of-fit test
Non-normal continuous data, 2 groupsContinuous (non-normal)Mann-Whitney U test (non-parametric)
Non-normal continuous data, 3+ groupsContinuous (non-normal)Kruskal-Wallis test (non-parametric)

Worked Example: Full 5-Step Hypothesis Test

Scenario: A coffee shop claims their "large" coffee contains 500ml. Quality control samples 25 large coffees and finds mean = 492ml, SD = 18ml. Is there evidence the fills are below spec?

Step 1 — Hypotheses: H₀: μ = 500ml (fills are on spec). H₁: μ < 500ml (fills are below spec). One-tailed test because we specifically care about under-filling.

Step 2 — Significance level: α = 0.05 (standard business decision threshold).

Step 3 — Test statistic: t = (492 − 500) / (18 / √25) = −8 / (18/5) = −8 / 3.6 = −2.22. df = 25 − 1 = 24.

Step 4 — P-value: P(T < −2.22 | df = 24) ≈ 0.018 (from t-distribution, one-tailed).

Step 5 — Decision: p = 0.018 < α = 0.05 → Reject H₀. There is statistically significant evidence that the mean fill is below 500ml. The 95% one-sided CI: μ < 492 + 1.711 × 3.6 = 498.2ml. The QC manager investigates the filling machine.

Frequently Asked Questions

Educational use only. Content is based on publicly documented mathematical formulas and reviewed for accuracy by the CalcMulti Editorial Team. Last updated: February 2026.