What is the difference between one-tailed and two-tailed hypothesis tests?

A two-tailed test asks: "Is the parameter different from H₀ value in either direction?" A one-tailed test asks: "Is the parameter greater than (or less than) H₀ value in a specific direction?" One-tailed tests have more power (easier to achieve significance) but require committing to a direction before data collection. Choosing one-tailed after seeing the data is p-hacking and inflates Type I error. Default to two-tailed unless you have a strong, pre-specified directional hypothesis.

What does "statistically significant" actually mean?

"Statistically significant at the 5% level" means p < 0.05: if the null hypothesis were true, the probability of observing data this extreme or more extreme by chance is less than 5%. It does NOT mean the result is important, large, or practically meaningful. A study with n=10,000 can find that a new drug lowers blood pressure by 0.5 mmHg (trivially small) with p < 0.001 — statistically significant but clinically meaningless. Always report effect sizes alongside p-values.

What is statistical power and why does it matter?

Power = 1 − β = the probability of correctly rejecting H₀ when the alternative is true. A study with 50% power will fail to detect a real effect half the time — wasting resources and possibly missing important findings. Standard recommendation: design studies with at least 80% power. Power is determined by: sample size (larger n = more power), effect size (larger effect = easier to detect), significance level α (higher α = more power), and measurement precision (lower variability = more power).

Can I run multiple hypothesis tests on the same dataset?

Yes, but with caution. If you run 20 independent tests at α = 0.05, you expect 1 false positive by chance even if all null hypotheses are true. This is the multiple comparisons problem. Solutions: Bonferroni correction (divide α by the number of tests: new α = 0.05/20 = 0.0025); Benjamini-Hochberg procedure (controls False Discovery Rate, less conservative); pre-registering your primary hypothesis before data collection.

What should I report alongside the p-value?

Always report: (1) the exact p-value (not just 'p < 0.05'); (2) the effect size (Cohen's d for means, r for correlation, odds ratio for proportions); (3) the confidence interval for the estimated parameter; (4) the sample size n. This gives readers enough information to judge both the statistical and practical significance of your result, and to include your study in a meta-analysis.

Hypothesis Testing Basics

Name: Hypothesis Testing Basics — A Step-by-Step Guide
Availability: OnlineOnly
Author: CalcMulti Editorial Team

By CalcMulti Editorial Team·Updated: February 2026·9 min read

Hypothesis testing is the formal procedure for deciding whether data provides enough evidence to reject a claim about a population. It is the engine behind A/B testing, clinical trials, quality control, and virtually all scientific research.

This guide walks through the 5-step process, explains the logic of null and alternative hypotheses, demystifies Type I and Type II errors, and shows you how to choose the right test for your situation.

The Core Logic of Hypothesis Testing

Hypothesis testing uses a proof-by-contradiction approach: assume the effect you are looking for does not exist (the null hypothesis H₀), then ask "how likely is it that I would see data this extreme if H₀ were true?" If that probability is very small (below your chosen significance threshold α), you have evidence to reject H₀.

Key analogy: hypothesis testing works like a legal trial. H₀ is "innocent" — you start by assuming no effect. The data is the evidence. The p-value measures how surprising the evidence is if H₀ were true. Rejecting H₀ means the evidence is strong enough to conclude "guilty" — but you could still be wrong (Type I error), just as convictions can be wrong.

What hypothesis testing does NOT do: it does not prove H₀ is true when you fail to reject it. Failing to reject H₀ means "insufficient evidence against it," not "H₀ is proven correct." It also does not measure how large or practically important an effect is — a tiny effect with a huge sample can produce p < 0.05 while being meaningless in practice.

The 5-Step Process

Step 1: State the hypotheses. H₀ (null hypothesis): the status quo or "no effect" claim. H₁ (alternative hypothesis): the effect you want to detect. Example: "A new drug reduces blood pressure" → H₀: μ_treatment = μ_control (no difference); H₁: μ_treatment < μ_control (drug reduces BP). Always state both hypotheses before collecting data.

Step 2: Choose significance level α. α is the probability of rejecting H₀ when it is actually true (Type I error rate). Standard choice: α = 0.05 (5%). Stricter choices: α = 0.01 (1%) for medical studies or α = 0.001 (0.1%) for high-stakes decisions. Also decide: one-tailed (directional test) or two-tailed (any difference).

Step 3: Collect data and compute the test statistic. The test statistic measures how far your sample result is from what H₀ predicts, in standard error units. For a one-sample mean test: t = (x̄ − μ₀) / (s / √n). For a proportion test: z = (p̂ − p₀) / √(p₀(1−p₀)/n). For categorical data: χ² = Σ(O−E)²/E.

Step 4: Find the p-value. The p-value is P(test statistic this extreme | H₀ is true). Look it up from the t-distribution, z-distribution, or chi-square distribution using your test statistic and degrees of freedom. Modern software and online calculators do this automatically.

Step 5: Make a decision and interpret. If p ≤ α: reject H₀ — the result is statistically significant. If p > α: fail to reject H₀ — insufficient evidence. Always report the exact p-value, effect size, and confidence interval — not just "significant/not significant."

Type I and Type II Errors

Because hypothesis tests are based on probabilities, they can lead to two types of wrong decisions. Understanding these errors is essential for choosing the right α and sample size.

	H₀ is actually TRUE	H₀ is actually FALSE
You reject H₀	Type I error (False Positive) — rate = α	Correct decision — rate = 1 − β (Power)
You fail to reject H₀	Correct decision — rate = 1 − α	Type II error (False Negative) — rate = β

Understanding Type I and Type II Errors

Type I error (false positive, rate = α): Rejecting H₀ when it is actually true. Example: concluding a drug works when it is actually no better than placebo. The significance level α directly controls this error rate — setting α = 0.05 means you accept a 5% chance of a false positive. Reduce Type I errors by lowering α, but this increases Type II errors.

Type II error (false negative, rate = β): Failing to reject H₀ when the alternative is actually true. Example: concluding a drug does not work when it actually does. The rate β is controlled by statistical power (= 1 − β). High power means a low chance of missing a real effect. Increase power by: increasing sample size, increasing the effect size being tested, or raising α.

The tradeoff: Lowering α reduces Type I errors but increases Type II errors (harder to detect real effects). The standard approach: set α = 0.05, then design the study with sufficient sample size to achieve 80% power (β = 0.20) — meaning an 80% chance of detecting the effect if it truly exists. A power analysis before data collection tells you the minimum sample size needed.

Choosing the Right Statistical Test

The correct test depends on: (1) the type of outcome variable (continuous vs categorical), (2) the number of groups being compared, (3) whether observations are independent or paired, and (4) whether normality assumptions hold.

Question	Data type	Test to use
Is mean different from a known value?	Continuous (1 group)	One-sample t-test
Are two independent group means different?	Continuous (2 groups)	Independent two-sample t-test (Welch)
Are before/after measurements different?	Continuous (paired)	Paired t-test
Are 3+ group means different?	Continuous (3+ groups)	One-way ANOVA
Is a proportion different from a known value?	Binary (1 proportion)	One-proportion z-test or binomial test
Are two proportions different?	Binary (2 groups)	Two-proportion z-test or chi-square
Is there an association between two categorical variables?	Categorical (2 variables)	Chi-square test of independence
Does the distribution match expected frequencies?	Categorical (counts)	Chi-square goodness-of-fit test
Non-normal continuous data, 2 groups	Continuous (non-normal)	Mann-Whitney U test (non-parametric)
Non-normal continuous data, 3+ groups	Continuous (non-normal)	Kruskal-Wallis test (non-parametric)

Worked Example: Full 5-Step Hypothesis Test

Scenario: A coffee shop claims their "large" coffee contains 500ml. Quality control samples 25 large coffees and finds mean = 492ml, SD = 18ml. Is there evidence the fills are below spec?

Step 1 — Hypotheses: H₀: μ = 500ml (fills are on spec). H₁: μ < 500ml (fills are below spec). One-tailed test because we specifically care about under-filling.

Step 2 — Significance level: α = 0.05 (standard business decision threshold).

Step 3 — Test statistic: t = (492 − 500) / (18 / √25) = −8 / (18/5) = −8 / 3.6 = −2.22. df = 25 − 1 = 24.

Step 4 — P-value: P(T < −2.22 | df = 24) ≈ 0.018 (from t-distribution, one-tailed).

Step 5 — Decision: p = 0.018 < α = 0.05 → Reject H₀. There is statistically significant evidence that the mean fill is below 500ml. The 95% one-sided CI: μ < 492 + 1.711 × 3.6 = 498.2ml. The QC manager investigates the filling machine.

Related Calculators

T-Test Calculator

Test if two means differ significantly

P-Value Calculator

Compute p-value from any test statistic

Chi-Square Calculator

Test categorical data distributions

Normal Distribution Explained

Distribution underlying most tests

Sample Size Calculator

Ensure adequate power before testing

Statistics Hub

All statistics calculators & guides

← Back to Statistics Hub

Frequently Asked Questions

Educational use only. Content is based on publicly documented mathematical formulas and reviewed for accuracy by the CalcMulti Editorial Team. Last updated: February 2026.