Regression Analysis Explained — How Linear Regression Works
By CalcMulti Editorial Team··10 min read
Linear regression models the relationship between an outcome variable (y) and one or more predictor variables (x). Simple linear regression uses a single predictor; multiple linear regression uses two or more. The goal: find the straight line y = b₀ + b₁x that best describes the data and allows prediction.
Regression is one of the most widely used statistical methods — for predicting sales from advertising spend, estimating salary from years of experience, or modelling the effect of a drug dose on recovery time. Understanding how regression works helps you interpret results correctly and avoid common pitfalls.
Formula
ŷ = b₀ + b₁x where b₁ = Σ(xᵢ−x̄)(yᵢ−ȳ) / Σ(xᵢ−x̄)² and b₀ = ȳ − b₁x̄
What Regression Analysis Does
Regression finds the "line of best fit" through your data — the line that minimises the total squared distance between the observed y values and the predicted ŷ values. This is the Ordinary Least Squares (OLS) criterion.
The slope (b₁) tells you: for each one-unit increase in x, the predicted y changes by b₁ units. If b₁ = 3.5 (hours of study vs exam score), then each additional hour of study is associated with 3.5 more points on the exam.
The intercept (b₀) tells you: the predicted value of y when x = 0. Sometimes the intercept is meaningful (e.g., baseline cost when quantity = 0); sometimes it is not interpretable (e.g., predicted salary when years of experience = 0 might be negative or outside the data range). Do not over-interpret the intercept if x = 0 is outside your data.
Worked Example — Advertising Spend vs Sales
Data: 5 weeks of TV advertising spend (x, $000s) and product sales (y, units): (1, 14), (2, 17), (3, 22), (4, 23), (5, 28).
x̄ = 3, ȳ = 20.8.
Slope b₁ = [(1−3)(14−20.8) + (2−3)(17−20.8) + (3−3)(22−20.8) + (4−3)(23−20.8) + (5−3)(28−20.8)] / [(1−3)² + (2−3)² + (3−3)² + (4−3)² + (5−3)²]
= [(−2)(−6.8) + (−1)(−3.8) + (0)(1.2) + (1)(2.2) + (2)(7.2)] / [4 + 1 + 0 + 1 + 4]
= [13.6 + 3.8 + 0 + 2.2 + 14.4] / 10 = 34 / 10 = 3.4.
Intercept b₀ = ȳ − b₁x̄ = 20.8 − 3.4 × 3 = 20.8 − 10.2 = 10.6.
Regression equation: ŷ = 10.6 + 3.4x.
Interpretation: Each additional $1,000 in TV advertising is associated with 3.4 more units sold. Predicted sales with $3,000 spend: ŷ = 10.6 + 3.4 × 3 = 20.8 units.
Understanding R² (Coefficient of Determination)
R² measures the proportion of variance in y that is explained by x. It ranges from 0 to 1.
R² = 0: the regression line is no better than simply predicting the mean ȳ for all observations.
R² = 1: the regression line perfectly predicts every y value — all points lie exactly on the line.
R² = 0.75 means the model explains 75% of the variance in y. The remaining 25% is unexplained variability ("residual" or "error").
R² is the square of the Pearson correlation coefficient r. For simple linear regression: R² = r².
| R² Value | Interpretation | Example Domain |
|---|---|---|
| 0.90 – 1.00 | Excellent fit | Physics, engineering (controlled experiments) |
| 0.70 – 0.89 | Good fit | Business, applied science |
| 0.50 – 0.69 | Moderate fit | Social sciences, economics |
| 0.30 – 0.49 | Weak fit | Behavioural research |
| < 0.30 | Poor fit | Complex human behaviour, noisy data |
Regression Assumptions You Must Check
1. Linearity: the relationship between x and y is linear. Check with a scatterplot before running regression. If the relationship is curved, transform variables (e.g., log x) or use polynomial regression.
2. Independence: observations are independent of each other. Time series data violates this — use time series regression methods.
3. Homoscedasticity: the spread of residuals is roughly constant across all values of x (equal variance). Check with a residuals vs. fitted values plot — the spread should be roughly uniform (no fan shape).
4. Normality of residuals: the residuals (y − ŷ) should be approximately normally distributed. Check with a Q-Q plot or histogram of residuals. This assumption matters mainly for small samples — large samples are robust via CLT.
5. No multicollinearity (multiple regression): predictor variables should not be highly correlated with each other. Check using Variance Inflation Factor (VIF > 5 is problematic).
Regression vs Correlation — Key Differences
Correlation (r) measures the strength and direction of a linear relationship — it is symmetric. The correlation between x and y equals the correlation between y and x.
Regression predicts one variable from another — it is asymmetric. The regression of y on x is different from the regression of x on y. Regression requires you to designate a predictor (x) and an outcome (y) based on your research question.
When to use regression: you want to predict y from x, or quantify how much y changes per unit of x (the slope). When to use correlation: you want to know how strongly two variables are associated without implying a directional prediction.
| Aspect | Correlation (r) | Regression (b₁) |
|---|---|---|
| What it measures | Association strength and direction | Rate of change (slope) |
| Symmetric? | Yes — r(x,y) = r(y,x) | No — different equations for x→y and y→x |
| Units | Dimensionless (−1 to +1) | Units of y per unit of x |
| Purpose | Describe relationship | Predict y from x |
| Influenced by SD? | No (standardised) | Yes (raw units) |
Related Calculators
Calculate slope, intercept, and R²
Correlation ExplainedRelationship between correlation and regression
Correlation CalculatorPearson r with significance test
Statistics Formulas GuideAll key regression formulas
Standard Error CalculatorPrecision of estimates
Statistics HubAll statistics calculators & guides
Frequently Asked Questions
Educational use only. Content is based on publicly documented mathematical formulas and reviewed for accuracy by the CalcMulti Editorial Team. Last updated: February 2026.