Data Science Fundamentals
Lesson 6: Supervised Learning – Regression and Classification
Objectives:
- Understand and differentiate between regression and classification tasks.
- Learn about key algorithms for regression and classification.
- Implement and evaluate regression and classification models.
1. Supervised Learning Overview
Supervised learning involves training a model on a labeled dataset, where the outcome is known. The goal is to make predictions or classifications based on input data.
Types:
- Regression: Predicts continuous values (e.g., predicting house prices).
- Classification: Predicts discrete labels or categories (e.g., email spam detection).
2. Regression Analysis
Regression models predict a continuous output based on input features. Common regression algorithms include:
a. Linear Regression Linear regression models the relationship between the dependent variable and one or more independent variables using a linear equation.
Code Example: Simple Linear Regression
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
# Sample Data
X = np.array([[1], [2], [3], [4], [5]]) # Independent variable
y = np.array([2, 3, 5, 7, 11]) # Dependent variable
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create and train model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Evaluate model
print(f"Intercept: {model.intercept_}")
print(f"Coefficient: {model.coef_}")
print(f"Mean Squared Error: {mean_squared_error(y_test, y_pred)}")
print(f"R^2 Score: {r2_score(y_test, y_pred)}")
# Plotting
plt.scatter(X_test, y_test, color='black', label='Actual data')
plt.plot(X_test, y_pred, color='blue', linewidth=3, label='Fitted line')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Simple Linear Regression')
plt.legend()
plt.show()
b. Polynomial Regression Polynomial regression extends linear regression by fitting a polynomial curve to the data.
Code Example: Polynomial Regression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
# Transform features
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)
# Split data
X_train, X_test, y_train, y_test = train_test_split(X_poly, y, test_size=0.2, random_state=42)
# Create and train model
model = make_pipeline(PolynomialFeatures(degree=2), LinearRegression())
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Evaluate model
print(f"Mean Squared Error: {mean_squared_error(y_test, y_pred)}")
print(f"R^2 Score: {r2_score(y_test, y_pred)}")
# Plotting
plt.scatter(X, y, color='black', label='Actual data')
X_fit = np.linspace(min(X), max(X), 100).reshape(-1, 1)
plt.plot(X_fit, model.predict(poly.transform(X_fit)), color='blue', linewidth=3, label='Fitted curve')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Polynomial Regression')
plt.legend()
plt.show()
3. Classification Analysis
Classification models predict categorical labels. Common classification algorithms include:
a. Logistic Regression Logistic regression is used for binary classification problems, predicting probabilities that can be mapped to classes.
Code Example: Logistic Regression
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
# Load dataset
iris = load_iris()
X = iris.data
y = iris.target
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Create and train model
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Evaluate model
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(f"Confusion Matrix:\n{confusion_matrix(y_test, y_pred)}")
print(f"Classification Report:\n{classification_report(y_test, y_pred)}")
b. k-Nearest Neighbors (k-NN) k-NN classifies data based on the majority label among its k nearest neighbors.
Code Example: k-NN Classification
from sklearn.neighbors import KNeighborsClassifier
# Create and train model
model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Evaluate model
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(f"Confusion Matrix:\n{confusion_matrix(y_test, y_pred)}")
print(f"Classification Report:\n{classification_report(y_test, y_pred)}")
4. Model Evaluation Metrics
-
For Regression:
- Mean Squared Error (MSE): Measures average squared error between predicted and actual values.
- R-squared (R²): Indicates the proportion of variance explained by the model.
-
For Classification:
- Accuracy: The proportion of correctly classified instances.
- Confusion Matrix: A matrix showing true positives, true negatives, false positives, and false negatives.
- Precision, Recall, F1-Score: Metrics to evaluate classification performance.
Code Example: Model Evaluation Metrics
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score, classification_report
# For regression
print(f"Mean Squared Error: {mean_squared_error(y_test, y_pred)}")
print(f"R^2 Score: {r2_score(y_test, y_pred)}")
# For classification
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(f"Confusion Matrix:\n{confusion_matrix(y_test, y_pred)}")
print(f"Classification Report:\n{classification_report(y_test, y_pred)}")
5. Key Takeaways
- Regression models predict continuous values and are evaluated using metrics like MSE and R².
- Classification models predict categorical outcomes and are evaluated using accuracy, confusion matrix, and classification report.
- Implementing and evaluating models requires understanding both the algorithms and appropriate metrics.
6. Homework/Practice:
- Choose a dataset and implement both regression and classification models.
- Evaluate and compare the performance of different algorithms using appropriate metrics.
- Experiment with hyperparameter tuning and model optimization techniques to improve performance.