Skip to Content

Linear Regression


Types of Linear Regression

  1. Simple Linear Regression:
    • This involves only one independent variable (predictor).
    • The relationship is modeled as a straight line:
      Y=β0+β1X+ϵY = \beta_0 + \beta_1 X + \epsilon Where:
      • YY is the dependent (target) variable.
      • XX is the independent (predictor) variable.
      • β0\beta_0 is the intercept (the value of YY when X=0X = 0).
      • β1\beta_1 is the slope (the change in YY for a one-unit change in XX).
      • ϵ\epsilon is the error term, accounting for the randomness in the data.
  2. Multiple Linear Regression:
    • This involves multiple independent variables (predictors). The equation is generalized as: Y=β0+β1X1+β2X2+⋯+βpXp+ϵY = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p + \epsilon Where:
      • X1,X2,…,XpX_1, X_2, \dots, X_p are the multiple independent variables.
      • β1,β2,…,βp\beta_1, \beta_2, \dots, \beta_p are the coefficients associated with each predictor variable.

Key Concepts

  1. Fitting the Model:
    • The goal of linear regression is to find the best-fit line (in simple linear regression) or the best-fit hyperplane (in multiple linear regression) that minimizes the error between the predicted and actual values of YY.
    • This is typically done using Ordinary Least Squares (OLS), which minimizes the sum of the squared residuals (errors).
  2. Residuals (Errors):
    • The residual is the difference between the observed value YY and the predicted value Y^\hat{Y}. Residual=Y−Y^\text{Residual} = Y - \hat{Y}
  3. Assumptions of Linear Regression:
    • Linearity: The relationship between the predictors and the target is linear.
    • Independence: The residuals are independent.
    • Homoscedasticity: The residuals have constant variance.
    • Normality: The residuals are normally distributed (especially important for hypothesis testing).
  4. Goodness of Fit:
    • The R-squared (R²) value is used to measure the goodness of fit. It represents the proportion of the variance in the dependent variable that is predictable from the independent variables.
    • R² ranges from 0 to 1:
      • R² = 1: Perfect fit.
      • R² = 0: The model does not explain any of the variance.
  5. Interpretation of Coefficients:
    • In simple linear regression, the slope coefficient β1\beta_1 tells us how much the dependent variable YY changes for a unit change in the independent variable XX.
    • In multiple linear regression, each coefficient represents the change in YY for a one-unit change in that predictor variable, holding other predictors constant.

Example of Linear Regression (Simple):

Let’s consider a simple example where we want to predict a person’s weight based on their height.

  1. Data Example:
    • Height (X): [150, 160, 170, 180, 190]
    • Weight (Y): [50, 60, 70, 80, 90]
  2. Simple Linear Regression Equation: Y=β0+β1XY = \beta_0 + \beta_1 X
    You would use linear regression to find the coefficients β0\beta_0 and β1\beta_1 that best fit the data.
  3. Steps to Fit the Model:
    • Calculate the line that best fits the data (i.e., find β0\beta_0 and β1\beta_1).
    • Use the equation to predict the weight for a new height.

Example in Python using sklearn:

Here’s how you can implement simple linear regression using Python’s scikit-learn library.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# Data
X = np.array([150, 160, 170, 180, 190]).reshape(-1, 1)  # Height
Y = np.array([50, 60, 70, 80, 90])  # Weight

# Create and fit the model
model = LinearRegression()
model.fit(X, Y)

# Get the coefficients
intercept = model.intercept_  # β0
slope = model.coef_  # β1

# Make predictions
predictions = model.predict(X)

# Plot the results
plt.scatter(X, Y, color='blue', label='Actual data')
plt.plot(X, predictions, color='red', label='Fitted line')
plt.xlabel('Height')
plt.ylabel('Weight')
plt.legend()
plt.show()

print(f'Intercept (β0): {intercept}')
print(f'Slope (β1): {slope}')

Output:

  • Intercept (β0): The estimated weight when height is zero (though this may not be practically meaningful).
  • Slope (β1): The change in weight for each unit change in height.

Evaluation:

  • : You can evaluate the model’s performance using R², which tells you how well the model explains the variance in the target variable. If R² is close to 1, the model is performing well.

Conclusion:

Linear regression is a powerful and easy-to-understand method for modeling relationships in data. It works well when the assumptions hold true (e.g., linearity, no multicollinearity), but it may not perform well if these assumptions are violated.