Implementing Linear Regression from Scratch in Python
This tutorial walks through implementing linear regression from scratch in Python, without using machine learning libraries like scikit-learn. We’ll cover the math behind linear regression, implement core functionality, and demonstrate usage with real data.
Overview
Our implementation will include:
- Basic matrix operations for linear algebra
- Linear regression weight calculation
- Prediction functionality
- Data loading and visualization
- Example applications
Implementation
Let’s start by implementing the core LinearRegression class:
class LinearRegression:
def __init__(self):
"""
Initializes the LinearRegression object with weights set to None.
"""
self.weights = None
def matmul(self, A, B):
"""
Matrix multiplication of A and B.
"""
if not (isinstance(A, list) and isinstance(B, list)) or len(A[0]) != len(B):
raise ValueError("Matrix dimensions are not compatible for multiplication.")
result = [[0 for _ in range(len(B[0]))] for _ in range(len(A))]
for i in range(len(A)):
for j in range(len(B[0])):
for k in range(len(B)):
result[i][j] += A[i][k] * B[k][j]
return result
def transpose(self, A):
"""
Transposes matrix A.
"""
return [[A[j][i] for j in range(len(A))] for i in range(len(A[0]))]
def inverse_2x2(self, A):
"""
Inverts a 2x2 matrix.
"""
if len(A) != 2 or len(A[0]) != 2:
raise ValueError("Matrix must be 2x2.")
det = A[0][0] * A[1][1] - A[0][1] * A[1][0]
if det == 0:
return None
return [[A[1][1] / det, -A[0][1] / det],
[-A[1][0] / det, A[0][0] / det]]
def fit(self, X, Y):
"""
Calculates regression weights using the normal equation method.
"""
X_transpose = self.transpose(X)
XTX = self.matmul(X_transpose, X)
if len(XTX) == 2 and len(XTX[0]) == 2:
XTX_inv = self.inverse_2x2(XTX)
if XTX_inv is None:
print("Unable to calculate weights - matrix inversion failed.")
return None
else:
print("Only 2x2 matrices are supported for the fit function.")
return None
XTY = self.matmul(X_transpose, Y)
self.weights = self.matmul(XTX_inv, XTY)
return self.weights
def predict(self, X):
"""
Makes predictions using calculated weights.
"""
if self.weights is None:
print("Cannot make predictions - weights not calculated.")
return None
if len(X[0]) != len(self.weights):
raise ValueError("The dimensions of X and weights are incompatible.")
return self.matmul(X, self.weights)
Using the Implementation
Let’s demonstrate usage with some example data:
# Initialize the model
lr = LinearRegression()
# Sample data
X = [[1, 1], [1, 2]] # Features with bias term
Y = [[5], [6]] # Target values
# Fit the model
weights = lr.fit(X, Y)
print("Calculated weights:", weights)
# Make predictions
predictions = lr.predict(X)
print("Predictions:", predictions)
Visualizing Results
The implementation includes plotting functionality to visualize the regression results:
def plot(self, X, Y, predicted_Y, future_X=None, future_predicted_Y=None, plot_options=None):
"""
Plots actual data points and regression line.
"""
if plot_options is None:
plot_options = {}
x_vals = [row[1] for row in X]
y_vals = [val[0] for val in Y]
# Get regression line parameters
w0 = self.weights[0][0] # Intercept
w1 = self.weights[1][0] # Slope
# Generate fitted line points
ind = np.linspace(min(x_vals), max(x_vals), 100)
fitted_line = ind * w1 + w0
# Plot actual data and regression line
plt.plot(x_vals, y_vals, 'bo', label='Actual data')
plt.plot(ind, fitted_line, 'r-', label='Fitted line')
if future_X and future_predicted_Y:
future_x_vals = [row[1] for row in future_X]
future_predicted_vals = [val[0] for val in future_predicted_Y]
plt.plot(future_x_vals, future_predicted_vals, 'g--',
label='Predicted future')
plt.xlabel(plot_options.get('x_label', 'X'))
plt.ylabel(plot_options.get('y_label', 'Y'))
plt.title(plot_options.get('title', 'Linear Regression Fit'))
plt.legend()
plt.show()
Example Application: Temperature Trends
Let’s use our implementation to analyze temperature data:
# Load temperature data
X_temps, Y_temps = lr.load_data('temperature_data.csv')
# Fit model and make predictions
lr.fit(X_temps, Y_temps)
predicted_temps = lr.predict(X_temps)
# Plot results
plot_options = {
'x_label': 'Year',
'y_label': 'Temperature (°C)',
'title': 'Temperature Trends Over Time'
}
lr.plot(X_temps, Y_temps, predicted_temps, plot_options=plot_options)
Key Features
- Matrix Operations: Custom implementations of matrix multiplication, transposition, and 2×2 matrix inversion
- Modular Design: Separate methods for fitting, prediction, and visualization
- Error Handling: Input validation and appropriate error messages
- Visualization: Flexible plotting options with support for future predictions
Limitations
- Only handles 2×2 matrices for inverse calculations
- Requires input data in specific format (lists of lists)
- No regularization or advanced features
- Limited error metrics and model evaluation tools
Conclusion
This implementation provides a foundation for understanding linear regression from first principles. While not as optimized as professional libraries like scikit-learn, it demonstrates the core concepts and mathematics behind linear regression.
For production use cases, it’s recommended to use established libraries that offer more features, better optimization, and support for larger datasets. However, this implementation serves as a valuable learning tool for understanding the fundamentals of linear regression.