Types of Linear Regression
-
Simple Linear Regression:
- This involves only one independent variable (predictor).
-
The relationship is modeled as a straight line:
Y=β0+β1X+ϵY = \beta_0 + \beta_1 X + \epsilon Where:- YY is the dependent (target) variable.
- XX is the independent (predictor) variable.
- β0\beta_0 is the intercept (the value of YY when X=0X = 0).
- β1\beta_1 is the slope (the change in YY for a one-unit change in XX).
- ϵ\epsilon is the error term, accounting for the randomness in the data.
-
Multiple Linear Regression:
-
This involves multiple independent variables (predictors). The equation is generalized as:
Y=β0+β1X1+β2X2+⋯+βpXp+ϵY = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p + \epsilon
Where:
- X1,X2,…,XpX_1, X_2, \dots, X_p are the multiple independent variables.
- β1,β2,…,βp\beta_1, \beta_2, \dots, \beta_p are the coefficients associated with each predictor variable.
-
This involves multiple independent variables (predictors). The equation is generalized as:
Y=β0+β1X1+β2X2+⋯+βpXp+ϵY = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p + \epsilon
Where:
Key Concepts
-
Fitting the Model:
- The goal of linear regression is to find the best-fit line (in simple linear regression) or the best-fit hyperplane (in multiple linear regression) that minimizes the error between the predicted and actual values of YY.
- This is typically done using Ordinary Least Squares (OLS), which minimizes the sum of the squared residuals (errors).
-
Residuals (Errors):
- The residual is the difference between the observed value YY and the predicted value Y^\hat{Y}. Residual=Y−Y^\text{Residual} = Y - \hat{Y}
-
Assumptions of Linear Regression:
- Linearity: The relationship between the predictors and the target is linear.
- Independence: The residuals are independent.
- Homoscedasticity: The residuals have constant variance.
- Normality: The residuals are normally distributed (especially important for hypothesis testing).
-
Goodness of Fit:
- The R-squared (R²) value is used to measure the goodness of fit. It represents the proportion of the variance in the dependent variable that is predictable from the independent variables.
-
R² ranges from 0 to 1:
- R² = 1: Perfect fit.
- R² = 0: The model does not explain any of the variance.
-
Interpretation of Coefficients:
- In simple linear regression, the slope coefficient β1\beta_1 tells us how much the dependent variable YY changes for a unit change in the independent variable XX.
- In multiple linear regression, each coefficient represents the change in YY for a one-unit change in that predictor variable, holding other predictors constant.
Example of Linear Regression (Simple):
Let’s consider a simple example where we want to predict a person’s weight based on their height.
-
Data Example:
- Height (X): [150, 160, 170, 180, 190]
- Weight (Y): [50, 60, 70, 80, 90]
-
Simple Linear Regression Equation:
Y=β0+β1XY = \beta_0 + \beta_1 X
You would use linear regression to find the coefficients β0\beta_0 and β1\beta_1 that best fit the data. -
Steps to Fit the Model:
- Calculate the line that best fits the data (i.e., find β0\beta_0 and β1\beta_1).
- Use the equation to predict the weight for a new height.
Example in Python using sklearn:
Here’s how you can implement simple linear regression using Python’s scikit-learn library.
import numpy as np import matplotlib.pyplot as plt from sklearn.linear_model import LinearRegression # Data X = np.array([150, 160, 170, 180, 190]).reshape(-1, 1) # Height Y = np.array([50, 60, 70, 80, 90]) # Weight # Create and fit the model model = LinearRegression() model.fit(X, Y) # Get the coefficients intercept = model.intercept_ # β0 slope = model.coef_ # β1 # Make predictions predictions = model.predict(X) # Plot the results plt.scatter(X, Y, color='blue', label='Actual data') plt.plot(X, predictions, color='red', label='Fitted line') plt.xlabel('Height') plt.ylabel('Weight') plt.legend() plt.show() print(f'Intercept (β0): {intercept}') print(f'Slope (β1): {slope}')
Output:
- Intercept (β0): The estimated weight when height is zero (though this may not be practically meaningful).
- Slope (β1): The change in weight for each unit change in height.
Evaluation:
- R²: You can evaluate the model’s performance using R², which tells you how well the model explains the variance in the target variable. If R² is close to 1, the model is performing well.
Conclusion:
Linear regression is a powerful and easy-to-understand method for modeling relationships in data. It works well when the assumptions hold true (e.g., linearity, no multicollinearity), but it may not perform well if these assumptions are violated.