{ "cells": [ { "cell_type": "markdown", "id": "header", "metadata": {}, "source": [ "# Multidimensional Data Exploration\n", "\n", "So far, we've explored one-dimensional data and seen how well we can describe it with histograms, eCDFs, and summary statistics. Today, we'll discover what happens when we move to multiple dimensions and why we need a fundamentally different approach: **models**.\n", "\n", "## Topics Covered\n", "- The limitations of describing multi-dimensional data\n", "- Introduction to linear regression models\n", "- Model fitting and interpretation\n", "- Residuals and model diagnostics\n", "- Extending to multiple predictors" ] }, { "cell_type": "markdown", "id": "setup-header", "metadata": {}, "source": [ "## Setup and Imports" ] }, { "cell_type": "code", "execution_count": null, "id": "imports", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "from sklearn.linear_model import LinearRegression\n", "from sklearn.metrics import r2_score\n", "import pandas as pd\n", "\n", "# Set random seed for reproducibility\n", "np.random.seed(42)\n", "\n", "# Set plotting style\n", "plt.style.use('default')\n", "sns.set_palette('husl')" ] }, { "cell_type": "markdown", "id": "helpers-header", "metadata": {}, "source": [ "## Helper Functions" ] }, { "cell_type": "code", "execution_count": null, "id": "helpers", "metadata": {}, "outputs": [], "source": [ "def generate_anscombe():\n", " \"\"\"\n", " Generate Anscombe's quartet - four datasets with identical statistics\n", " but very different patterns.\n", " \n", " Returns:\n", " dict: Dictionary containing x and y arrays for each of 4 datasets\n", " \"\"\"\n", " # Dataset I: Linear relationship\n", " x1 = np.array([10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5])\n", " y1 = np.array([8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68])\n", " \n", " # Dataset II: Non-linear relationship\n", " x2 = np.array([10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5])\n", " y2 = np.array([9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74])\n", " \n", " # Dataset III: Linear with outlier\n", " x3 = np.array([10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5])\n", " y3 = np.array([7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73])\n", " \n", " # Dataset IV: Outlier creates apparent relationship\n", " x4 = np.array([8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8])\n", " y4 = np.array([6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89])\n", " \n", " return {\n", " 'x1': x1, 'y1': y1,\n", " 'x2': x2, 'y2': y2,\n", " 'x3': x3, 'y3': y3,\n", " 'x4': x4, 'y4': y4\n", " }\n", "\n", "def plot_3d_scatter(X, y, feature_names=None):\n", " \"\"\"\n", " Create a 3D scatter plot (2 features + 1 target)\n", " \n", " Parameters:\n", " X: 2D array of shape (n, 2) - two features\n", " y: 1D array of shape (n,) - target variable\n", " feature_names: list of strings for axis labels\n", " \"\"\"\n", " from mpl_toolkits.mplot3d import Axes3D\n", " \n", " fig = plt.figure(figsize=(10, 8))\n", " ax = fig.add_subplot(111, projection='3d')\n", " \n", " scatter = ax.scatter(X[:, 0], X[:, 1], y, c=y, cmap='viridis', \n", " s=50, alpha=0.6, edgecolor='black')\n", " \n", " if feature_names:\n", " ax.set_xlabel(feature_names[0])\n", " ax.set_ylabel(feature_names[1])\n", " ax.set_zlabel(feature_names[2])\n", " else:\n", " ax.set_xlabel('Feature 1')\n", " ax.set_ylabel('Feature 2')\n", " ax.set_zlabel('Target')\n", " \n", " plt.colorbar(scatter, label='Target Value')\n", " plt.tight_layout()\n", " plt.show()\n", "\n", "print(\"Helper functions loaded successfully!\")" ] }, { "cell_type": "markdown", "id": "part1-header", "metadata": {}, "source": [ "---\n", "## Part 1: The 1D Comfort Zone\n", "\n", "Let's remind ourselves how well we can describe one-dimensional data. We have a complete vocabulary: histograms, eCDFs, summary statistics." ] }, { "cell_type": "code", "execution_count": null, "id": "part1-demo", "metadata": {}, "outputs": [], "source": [ "# Generate 1D data - exam scores\n", "np.random.seed(42)\n", "exam_scores = np.random.normal(75, 12, 200)\n", "exam_scores = np.clip(exam_scores, 0, 100)\n", "\n", "# Create comprehensive visualization\n", "fig, axes = plt.subplots(1, 3, figsize=(15, 4))\n", "\n", "# Histogram\n", "axes[0].hist(exam_scores, bins=25, edgecolor='black', alpha=0.7, color='skyblue')\n", "axes[0].set_title('Histogram', fontsize=12, fontweight='bold')\n", "axes[0].set_xlabel('Score')\n", "axes[0].set_ylabel('Frequency')\n", "axes[0].grid(alpha=0.3)\n", "\n", "# eCDF\n", "sorted_scores = np.sort(exam_scores)\n", "ecdf = np.arange(1, len(sorted_scores)+1) / len(sorted_scores)\n", "axes[1].plot(sorted_scores, ecdf, linewidth=2, color='darkblue')\n", "axes[1].set_title('Empirical CDF', fontsize=12, fontweight='bold')\n", "axes[1].set_xlabel('Score')\n", "axes[1].set_ylabel('Cumulative Probability')\n", "axes[1].grid(alpha=0.3)\n", "\n", "# Summary statistics\n", "stats_text = f\"\"\"Mean: {np.mean(exam_scores):.1f}\n", "Median: {np.median(exam_scores):.1f}\n", "Std: {np.std(exam_scores, ddof=1):.1f}\n", "Min: {exam_scores.min():.1f}\n", "Max: {exam_scores.max():.1f}\n", "\n", "25th %: {np.percentile(exam_scores, 25):.1f}\n", "75th %: {np.percentile(exam_scores, 75):.1f}\"\"\"\n", "axes[2].text(0.15, 0.5, stats_text, fontsize=11, family='monospace',\n", " verticalalignment='center',\n", " bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))\n", "axes[2].axis('off')\n", "axes[2].set_title('Summary Statistics', fontsize=12, fontweight='bold')\n", "\n", "plt.tight_layout()\n", "plt.show()\n", "\n", "print(\"\\n\" + \"=\"*60)\n", "print(\"For 1D data: We have COMPLETE vocabulary!\")\n", "print(\"=\"*60)\n", "print(\"Histogram + eCDF + summary stats = full description\")\n", "print(\"We can answer almost any question about this distribution.\")" ] }, { "cell_type": "markdown", "id": "part2-header", "metadata": {}, "source": [ "---\n", "## Part 2: Welcome to 2D - The Vocabulary Problem\n", "\n", "Now let's move to two dimensions. We'll quickly discover that our 1D tools don't scale well." ] }, { "cell_type": "markdown", "id": "part2-generate", "metadata": {}, "source": [ "### Generate Multiple 2D Datasets" ] }, { "cell_type": "code", "execution_count": null, "id": "part2-data", "metadata": {}, "outputs": [], "source": [ "np.random.seed(42)\n", "n = 100\n", "\n", "# Dataset A: Strong positive correlation\n", "x_A = np.random.randn(n)\n", "y_A = 2*x_A + np.random.randn(n)*0.5\n", "\n", "# Dataset B: Weak positive correlation\n", "x_B = np.random.randn(n)\n", "y_B = 0.5*x_B + np.random.randn(n)*2\n", "\n", "# Dataset C: No correlation\n", "x_C = np.random.randn(n)\n", "y_C = np.random.randn(n)\n", "\n", "# Dataset D: Non-linear (quadratic)\n", "x_D = np.linspace(-3, 3, n)\n", "y_D = x_D**2 + np.random.randn(n)*0.8\n", "\n", "print(f\"Generated {n} points for each of 4 datasets\")" ] }, { "cell_type": "code", "execution_count": null, "id": "part2-viz", "metadata": {}, "outputs": [], "source": [ "# Visualize all four datasets\n", "fig, axes = plt.subplots(2, 2, figsize=(12, 10))\n", "datasets = [\n", " (x_A, y_A, 'Dataset A'),\n", " (x_B, y_B, 'Dataset B'),\n", " (x_C, y_C, 'Dataset C'),\n", " (x_D, y_D, 'Dataset D')\n", "]\n", "\n", "for ax, (x, y, title) in zip(axes.ravel(), datasets):\n", " ax.scatter(x, y, alpha=0.6, edgecolor='black', s=40)\n", " ax.set_title(title, fontsize=12, fontweight='bold')\n", " ax.set_xlabel('X')\n", " ax.set_ylabel('Y')\n", " ax.grid(alpha=0.3)\n", " ax.axhline(0, color='gray', linewidth=0.5, linestyle='--', alpha=0.5)\n", " ax.axvline(0, color='gray', linewidth=0.5, linestyle='--', alpha=0.5)\n", "\n", "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "activity2-1-header", "metadata": {}, "source": [ "### Activity 2.1: The Description Challenge\n", "\n", "**Task**: Try to describe each dataset above in words. Be as precise as possible!" ] }, { "cell_type": "code", "execution_count": null, "id": "activity2-1", "metadata": {}, "outputs": [], "source": [ "# TODO: Describe each dataset in 1-2 sentences\n", "# Try to be precise enough that someone could distinguish them!\n", "\n", "description_A = \"\"\"\n", "YOUR DESCRIPTION HERE\n", "\"\"\"\n", "\n", "description_B = \"\"\"\n", "YOUR DESCRIPTION HERE\n", "\"\"\"\n", "\n", "description_C = \"\"\"\n", "YOUR DESCRIPTION HERE\n", "\"\"\"\n", "\n", "description_D = \"\"\"\n", "YOUR DESCRIPTION HERE\n", "\"\"\"\n", "\n", "# Reflection questions:\n", "# - Was it easy to describe these precisely?\n", "# - What vocabulary are you missing?\n", "# - What single number might help summarize the relationship?" ] }, { "cell_type": "markdown", "id": "activity2-2-header", "metadata": {}, "source": [ "### Activity 2.2: Correlation - One Number Summary\n", "\n", "Correlation gives us a single number (between -1 and +1) that measures linear association. But is one number enough?" ] }, { "cell_type": "code", "execution_count": null, "id": "activity2-2", "metadata": {}, "outputs": [], "source": [ "# TODO: Calculate correlation coefficient for each dataset\n", "# Use: np.corrcoef(x, y)[0, 1] to extract the correlation value\n", "\n", "corr_A = None # YOUR CODE\n", "corr_B = None # YOUR CODE\n", "corr_C = None # YOUR CODE\n", "corr_D = None # YOUR CODE\n", "\n", "# TODO: Print results with interpretation\n", "print(\"Correlation coefficients:\")\n", "print(f\"Dataset A: {corr_A:.3f}\")\n", "print(f\"Dataset B: {corr_B:.3f}\")\n", "print(f\"Dataset C: {corr_C:.3f}\")\n", "print(f\"Dataset D: {corr_D:.3f}\")\n", "\n", "# Reflection questions:\n", "# - Which dataset has the strongest linear relationship?\n", "# - Does correlation fully capture what you see in Dataset D? Why or why not?\n", "# - What information does correlation miss?" ] }, { "cell_type": "markdown", "id": "anscombe-header", "metadata": {}, "source": [ "### Anscombe's Quartet: The Ultimate Warning\n", "\n", "Four datasets with **identical** summary statistics and correlations, but completely different patterns." ] }, { "cell_type": "code", "execution_count": null, "id": "anscombe", "metadata": {}, "outputs": [], "source": [ "# Load Anscombe's quartet\n", "anscombe_data = generate_anscombe()\n", "\n", "fig, axes = plt.subplots(2, 2, figsize=(12, 10))\n", "dataset_names = ['I', 'II', 'III', 'IV']\n", "\n", "for idx, (ax, dataset_name) in enumerate(zip(axes.ravel(), dataset_names)):\n", " x = anscombe_data[f'x{idx+1}']\n", " y = anscombe_data[f'y{idx+1}']\n", " \n", " # Scatter plot\n", " ax.scatter(x, y, s=80, edgecolor='black', alpha=0.7)\n", " \n", " # Calculate and display correlation\n", " corr = np.corrcoef(x, y)[0, 1]\n", " mean_x = np.mean(x)\n", " mean_y = np.mean(y)\n", " \n", " # Add statistics as text\n", " stats_text = f'r = {corr:.3f}\\nmean(x) = {mean_x:.1f}\\nmean(y) = {mean_y:.2f}'\n", " ax.text(0.05, 0.95, stats_text, transform=ax.transAxes, \n", " fontsize=10, verticalalignment='top',\n", " bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.7))\n", " \n", " ax.set_title(f'Dataset {dataset_name}', fontsize=12, fontweight='bold')\n", " ax.set_xlabel('X')\n", " ax.set_ylabel('Y')\n", " ax.grid(alpha=0.3)\n", "\n", "plt.suptitle(\"Anscombe's Quartet: Same Statistics, Different Stories!\", \n", " fontsize=14, fontweight='bold', y=1.00)\n", "plt.tight_layout()\n", "plt.show()\n", "\n", "print(\"\\n\" + \"=\"*60)\n", "print(\"KEY INSIGHT\")\n", "print(\"=\"*60)\n", "print(\"All four datasets have nearly IDENTICAL:\")\n", "print(\" • Mean of X and Y\")\n", "print(\" • Variance of X and Y\")\n", "print(\" • Correlation coefficient\")\n", "print(\" • Linear regression line (we'll see this soon!)\")\n", "print(\"\\nBut they tell COMPLETELY different stories!\")\n", "print(\"\\nTakeaway: Summary statistics alone are insufficient.\")\n", "print(\" Always visualize! But even visualization has limits...\")" ] }, { "cell_type": "markdown", "id": "part3-header", "metadata": {}, "source": [ "---\n", "## Part 3: Enter the Linear Model\n", "\n", "What if instead of describing relationships with words or single numbers, we used **equations**? This is where linear regression comes in.\n", "\n", "### The Idea\n", "Model the relationship as: **Y = β₀ + β₁ × X + error**\n", "\n", "Where:\n", "- β₀ (intercept) = value of Y when X = 0\n", "- β₁ (slope) = change in Y for each unit increase in X\n", "- error = what the model doesn't explain (residuals)" ] }, { "cell_type": "markdown", "id": "demo-header", "metadata": {}, "source": [ "### Demonstration: Your First Linear Model" ] }, { "cell_type": "code", "execution_count": null, "id": "demo-data", "metadata": {}, "outputs": [], "source": [ "# Generate demo data: study hours vs exam score\n", "np.random.seed(42)\n", "study_hours = np.random.uniform(1, 10, 80)\n", "exam_score = 40 + 5*study_hours + np.random.randn(80)*4\n", "exam_score = np.clip(exam_score, 0, 100)\n", "\n", "print(f\"Generated {len(study_hours)} students\")\n", "print(f\"Study hours range: {study_hours.min():.1f} to {study_hours.max():.1f}\")\n", "print(f\"Exam scores range: {exam_score.min():.1f} to {exam_score.max():.1f}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "demo-fit", "metadata": {}, "outputs": [], "source": [ "# Fit a linear model\n", "from sklearn.linear_model import LinearRegression\n", "\n", "# Prepare data - sklearn needs X to be 2D\n", "X = study_hours.reshape(-1, 1) # Convert to column vector\n", "y = exam_score\n", "\n", "# Create and fit the model\n", "model = LinearRegression()\n", "model.fit(X, y)\n", "\n", "# Extract the equation components\n", "intercept = model.intercept_\n", "slope = model.coef_[0]\n", "\n", "print(\"\\n\" + \"=\"*60)\n", "print(\"LINEAR MODEL\")\n", "print(\"=\"*60)\n", "print(f\"Equation: score = {intercept:.2f} + {slope:.2f} × hours\")\n", "print(f\"\\nInterpretation:\")\n", "print(f\" • Intercept ({intercept:.2f}): Expected score with 0 hours of study\")\n", "print(f\" • Slope ({slope:.2f}): Each hour of study adds ~{slope:.2f} points\")" ] }, { "cell_type": "code", "execution_count": null, "id": "demo-viz", "metadata": {}, "outputs": [], "source": [ "# Visualize the model\n", "y_pred = model.predict(X)\n", "\n", "plt.figure(figsize=(10, 6))\n", "plt.scatter(study_hours, exam_score, alpha=0.6, edgecolor='black', \n", " s=60, label='Actual data')\n", "plt.plot(study_hours, y_pred, 'r-', linewidth=2.5, label='Fitted line')\n", "plt.xlabel('Study Hours', fontsize=11)\n", "plt.ylabel('Exam Score', fontsize=11)\n", "plt.title('Linear Regression: Study Hours → Exam Score', \n", " fontsize=13, fontweight='bold')\n", "plt.legend(fontsize=10)\n", "plt.grid(alpha=0.3)\n", "plt.tight_layout()\n", "plt.show()\n", "\n", "print(\"\\nThe red line is our MODEL - it gives us vocabulary!\")\n", "print(\"Instead of describing with words, we have an EQUATION.\")" ] }, { "cell_type": "markdown", "id": "activity3-1-header", "metadata": {}, "source": [ "### Activity 3.1: Fit Your First Model\n", "\n", "**Dataset**: Temperature (°C) vs Ice Cream Sales ($1000s)\n", "\n", "Your task: Fit a linear model and interpret the results." ] }, { "cell_type": "code", "execution_count": null, "id": "activity3-1-data", "metadata": {}, "outputs": [], "source": [ "# Generate temperature and sales data\n", "np.random.seed(123)\n", "temperature = np.random.uniform(15, 35, 100)\n", "sales = -2 + 0.8*temperature + np.random.randn(100)*1.5\n", "\n", "print(f\"Temperature range: {temperature.min():.1f}°C to {temperature.max():.1f}°C\")\n", "print(f\"Sales range: ${sales.min():.1f}k to ${sales.max():.1f}k\")" ] }, { "cell_type": "code", "execution_count": null, "id": "activity3-1", "metadata": {}, "outputs": [], "source": [ "# TODO: Prepare the data for sklearn (remember X must be 2D)\n", "X = None # YOUR CODE - use .reshape(-1, 1)\n", "y = None # YOUR CODE\n", "\n", "# TODO: Create a LinearRegression model and fit it\n", "model = None # YOUR CODE - create the model\n", "# YOUR CODE - fit the model to X and y\n", "\n", "# TODO: Extract the intercept and slope\n", "intercept = None # YOUR CODE\n", "slope = None # YOUR CODE\n", "\n", "# TODO: Print the equation\n", "print(f\"Equation: Sales = ??? + ??? × Temperature\")\n", "\n", "# TODO: Interpret the slope in plain English\n", "print(f\"\\nInterpretation:\")\n", "print(f\"Each 1°C increase in temperature increases sales by $___k\")\n", "\n", "# TODO: Make predictions for the training data\n", "y_pred = None # YOUR CODE\n", "\n", "# TODO: Predict sales when temperature is 25°C\n", "temp_25 = None # YOUR CODE - create array with shape (1, 1)\n", "predicted_sales_25 = None # YOUR CODE\n", "print(f\"\\nPredicted sales at 25°C: ${predicted_sales_25[0]:.1f}k\")" ] }, { "cell_type": "code", "execution_count": null, "id": "activity3-1-viz", "metadata": {}, "outputs": [], "source": [ "# TODO: Create a visualization with scatter plot + fitted line\n", "plt.figure(figsize=(10, 6))\n", "\n", "# YOUR CODE - scatter plot of temperature vs sales\n", "# YOUR CODE - plot the fitted line (temperature vs y_pred)\n", "# YOUR CODE - add labels, title, legend\n", "\n", "plt.grid(alpha=0.3)\n", "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "activity3-2-header", "metadata": {}, "source": [ "### Activity 3.2: Understanding Residuals\n", "\n", "**Residuals** = Actual values - Predicted values\n", "\n", "They tell us what the model **doesn't** explain. Let's investigate them!" ] }, { "cell_type": "code", "execution_count": null, "id": "activity3-2", "metadata": {}, "outputs": [], "source": [ "# Continue with the temperature/sales data from above\n", "\n", "# TODO: Calculate predictions for all data points (if not already done)\n", "y_pred = None # YOUR CODE (or use from above)\n", "\n", "# TODO: Calculate residuals (actual - predicted)\n", "residuals = None # YOUR CODE\n", "\n", "# TODO: Calculate summary statistics of residuals\n", "mean_residual = None # YOUR CODE\n", "std_residual = None # YOUR CODE\n", "max_abs_residual = None # YOUR CODE - use np.abs() then .max()\n", "\n", "print(\"Residual Statistics:\")\n", "print(f\" Mean: {mean_residual:.3f}\") \n", "print(f\" Std: {std_residual:.3f}\")\n", "print(f\" Max absolute error: {max_abs_residual:.3f}\")\n", "\n", "# Question: Why should the mean residual be close to 0?" ] }, { "cell_type": "code", "execution_count": null, "id": "activity3-2-plot", "metadata": {}, "outputs": [], "source": [ "# TODO: Create a residual plot\n", "# X-axis: predicted values (y_pred)\n", "# Y-axis: residuals\n", "\n", "plt.figure(figsize=(10, 6))\n", "\n", "# YOUR CODE - scatter plot of residuals vs predictions\n", "# YOUR CODE - add horizontal line at y=0 (use plt.axhline)\n", "# YOUR CODE - labels and title\n", "\n", "plt.grid(alpha=0.3)\n", "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "id": "activity3-2-worst", "metadata": {}, "outputs": [], "source": [ "# TODO: Find the 3 worst predictions (largest absolute residuals)\n", "abs_residuals = None # YOUR CODE\n", "worst_indices = None # YOUR CODE - use np.argsort()[-3:][::-1] to get top 3\n", "\n", "print(\"\\nThree worst predictions:\")\n", "for idx in worst_indices:\n", " print(f\" Temp={temperature[idx]:.1f}°C: \"\n", " f\"Actual=${sales[idx]:.1f}k, \"\n", " f\"Predicted=${y_pred[idx]:.1f}k, \"\n", " f\"Error=${residuals[idx]:.1f}k\")\n", "\n", "# Reflection: What might cause large residuals?" ] }, { "cell_type": "markdown", "id": "part4-header", "metadata": {}, "source": [ "---\n", "## Part 4: Moving to 3D - Multiple Predictors\n", "\n", "What happens when we have **multiple** input variables? Can we still use linear models?\n", "\n", "### The Extended Model\n", "**Y = β₀ + β₁ × X₁ + β₂ × X₂ + error**\n", "\n", "Now we have multiple slopes - one for each predictor!" ] }, { "cell_type": "markdown", "id": "part4-data-header", "metadata": {}, "source": [ "### Generate 3D Data: Student Test Scores" ] }, { "cell_type": "code", "execution_count": null, "id": "part4-data", "metadata": {}, "outputs": [], "source": [ "# Scenario: Predict final exam from two midterms\n", "np.random.seed(42)\n", "n_students = 120\n", "\n", "# Generate correlated test scores\n", "midterm1 = np.random.normal(75, 10, n_students)\n", "midterm2 = 0.6*midterm1 + np.random.normal(70, 8, n_students)\n", "final = 0.4*midterm1 + 0.5*midterm2 + np.random.normal(10, 5, n_students)\n", "\n", "# Clip to valid range\n", "midterm1 = np.clip(midterm1, 0, 100)\n", "midterm2 = np.clip(midterm2, 0, 100)\n", "final = np.clip(final, 0, 100)\n", "\n", "print(f\"Generated scores for {n_students} students\")\n", "print(f\"\\nMidterm 1: mean={midterm1.mean():.1f}, std={midterm1.std():.1f}\")\n", "print(f\"Midterm 2: mean={midterm2.mean():.1f}, std={midterm2.std():.1f}\")\n", "print(f\"Final: mean={final.mean():.1f}, std={final.std():.1f}\")" ] }, { "cell_type": "markdown", "id": "part4-viz-header", "metadata": {}, "source": [ "### Visualizing 3D Data" ] }, { "cell_type": "code", "execution_count": null, "id": "part4-pairplot", "metadata": {}, "outputs": [], "source": [ "# Create a pairplot to see all relationships\n", "df = pd.DataFrame({\n", " 'Midterm 1': midterm1,\n", " 'Midterm 2': midterm2,\n", " 'Final': final\n", "})\n", "\n", "sns.pairplot(df, diag_kind='kde', plot_kws={'alpha':0.6, 'edgecolor':'black', 's':40})\n", "plt.suptitle('Pairwise Relationships Between Test Scores', y=1.02, fontsize=13, fontweight='bold')\n", "plt.show()\n", "\n", "print(\"\\nNotice: With 3 variables, we need to look at 3 different scatter plots!\")\n", "print(\"The visualization is already getting complicated...\")" ] }, { "cell_type": "code", "execution_count": null, "id": "part4-corr", "metadata": {}, "outputs": [], "source": [ "# Correlation matrix\n", "corr_matrix = df.corr()\n", "\n", "plt.figure(figsize=(8, 6))\n", "sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0, \n", " vmin=-1, vmax=1, square=True, cbar_kws={'label': 'Correlation'},\n", " fmt='.3f', linewidths=1)\n", "plt.title('Correlation Matrix', fontsize=13, fontweight='bold')\n", "plt.tight_layout()\n", "plt.show()\n", "\n", "print(\"\\nWith 3 variables, we track 3 correlations:\")\n", "print(f\" Midterm1 ↔ Midterm2: {corr_matrix.loc['Midterm 1', 'Midterm 2']:.3f}\")\n", "print(f\" Midterm1 ↔ Final: {corr_matrix.loc['Midterm 1', 'Final']:.3f}\")\n", "print(f\" Midterm2 ↔ Final: {corr_matrix.loc['Midterm 2', 'Final']:.3f}\")\n", "print(\"\\nBut which predictor matters more? Hard to say from correlations alone!\")" ] }, { "cell_type": "code", "execution_count": null, "id": "part4-3d", "metadata": {}, "outputs": [], "source": [ "# Optional: 3D scatter plot (hard to interpret!)\n", "X_3d = np.column_stack([midterm1, midterm2])\n", "plot_3d_scatter(X_3d, final, feature_names=['Midterm 1', 'Midterm 2', 'Final'])\n", "\n", "print(\"3D plots are hard to read, especially on a 2D screen!\")\n", "print(\"This is where the MODEL really shines...\")" ] }, { "cell_type": "markdown", "id": "activity4-1-header", "metadata": {}, "source": [ "### Activity 4.1: Multiple Regression\n", "\n", "Fit a model with **both** midterms as predictors: **Final = β₀ + β₁ × Midterm1 + β₂ × Midterm2 + error**" ] }, { "cell_type": "code", "execution_count": null, "id": "activity4-1", "metadata": {}, "outputs": [], "source": [ "# TODO: Prepare X matrix with BOTH predictors\n", "# Use np.column_stack([array1, array2]) to combine into 2D array\n", "X = None # YOUR CODE\n", "\n", "# TODO: Prepare y (the final scores)\n", "y = None # YOUR CODE\n", "\n", "# TODO: Create and fit the model\n", "model = None # YOUR CODE\n", "# YOUR CODE - fit the model\n", "\n", "# TODO: Extract coefficients\n", "intercept = None # YOUR CODE\n", "coef_midterm1 = None # YOUR CODE - model.coef_[0]\n", "coef_midterm2 = None # YOUR CODE - model.coef_[1]\n", "\n", "# TODO: Print the equation\n", "print(\"Multiple Regression Equation:\")\n", "print(f\"Final = ??? + ??? × Midterm1 + ??? × Midterm2\")\n", "\n", "# TODO: Interpret each coefficient\n", "print(\"\\nInterpretations:\")\n", "print(f\" • Midterm1 coefficient ({coef_midterm1:.2f}):\")\n", "print(f\" Holding Midterm2 constant, 1 point increase in Midterm1\")\n", "print(f\" → ??? point change in Final\")\n", "print(f\" • Midterm2 coefficient ({coef_midterm2:.2f}):\")\n", "print(f\" Holding Midterm1 constant, 1 point increase in Midterm2\")\n", "print(f\" → ??? point change in Final\")" ] }, { "cell_type": "code", "execution_count": null, "id": "activity4-1-pred", "metadata": {}, "outputs": [], "source": [ "# TODO: Make predictions for specific students\n", "# Student A: Midterm1=80, Midterm2=75\n", "# Student B: Midterm1=90, Midterm2=65\n", "\n", "student_A = None # YOUR CODE - array with shape (1, 2)\n", "student_B = None # YOUR CODE - array with shape (1, 2)\n", "\n", "pred_A = None # YOUR CODE\n", "pred_B = None # YOUR CODE\n", "\n", "print(f\"\\nPredictions:\")\n", "print(f\" Student A (Midterm1=80, Midterm2=75) → Final = {pred_A[0]:.1f}\")\n", "print(f\" Student B (Midterm1=90, Midterm2=65) → Final = {pred_B[0]:.1f}\")\n", "print(\"\\nNotice: Even though B scored higher on Midterm1,\")\n", "print(\"the predictions might be similar due to different Midterm2 scores!\")" ] }, { "cell_type": "code", "execution_count": null, "id": "activity4-1-r2", "metadata": {}, "outputs": [], "source": [ "# TODO: Calculate R² (goodness of fit)\n", "y_pred = None # YOUR CODE - predictions for all students\n", "r2 = None # YOUR CODE - use r2_score(y_true, y_pred)\n", "\n", "print(f\"\\nModel Performance:\")\n", "print(f\" R² = {r2:.3f}\")\n", "print(f\" The model explains {r2*100:.1f}% of variance in final scores\")\n", "print(f\" {(1-r2)*100:.1f}% of variance remains unexplained (residuals)\")" ] }, { "cell_type": "markdown", "id": "activity4-2-header", "metadata": {}, "source": [ "### Activity 4.2: Which Variable Matters More?\n", "\n", "Compare three models:\n", "- Model A: Final ~ Midterm1 only\n", "- Model B: Final ~ Midterm2 only\n", "- Model C: Final ~ Midterm1 + Midterm2 (from above)" ] }, { "cell_type": "code", "execution_count": null, "id": "activity4-2", "metadata": {}, "outputs": [], "source": [ "# TODO: Model A - Midterm1 only\n", "X_A = None # YOUR CODE - midterm1 reshaped to 2D\n", "model_A = None # YOUR CODE - create and fit\n", "y_pred_A = None # YOUR CODE\n", "r2_A = None # YOUR CODE\n", "\n", "# TODO: Model B - Midterm2 only \n", "X_B = None # YOUR CODE - midterm2 reshaped to 2D\n", "model_B = None # YOUR CODE - create and fit\n", "y_pred_B = None # YOUR CODE\n", "r2_B = None # YOUR CODE\n", "\n", "# Model C - Both predictors (already computed above)\n", "r2_C = r2 # From previous activity\n", "\n", "# TODO: Print comparison\n", "print(\"=\"*60)\n", "print(\"MODEL COMPARISON\")\n", "print(\"=\"*60)\n", "print(f\"Model A (Midterm1 only): R² = {r2_A:.3f} ({r2_A*100:.1f}% variance)\")\n", "print(f\"Model B (Midterm2 only): R² = {r2_B:.3f} ({r2_B*100:.1f}% variance)\")\n", "print(f\"Model C (Both predictors): R² = {r2_C:.3f} ({r2_C*100:.1f}% variance)\")\n", "\n", "# TODO: Calculate improvements\n", "improvement_over_A = None # YOUR CODE - r2_C - r2_A\n", "improvement_over_B = None # YOUR CODE - r2_C - r2_B\n", "\n", "print(f\"\\nImprovements:\")\n", "print(f\" Adding Midterm2 to Midterm1 improves R² by {improvement_over_A:.3f}\")\n", "print(f\" Adding Midterm1 to Midterm2 improves R² by {improvement_over_B:.3f}\")\n", "\n", "# Reflection questions:\n", "# - Which single predictor is better?\n", "# - Does using both predictors help?\n", "# - Is the improvement substantial?" ] }, { "cell_type": "markdown", "id": "part5-header", "metadata": {}, "source": [ "---\n", "## Part 5: Reality Check - When Linear Models Fail\n", "\n", "Linear models are powerful, but they're not always appropriate. Let's see when they fail and how to detect it." ] }, { "cell_type": "markdown", "id": "part5-data-header", "metadata": {}, "source": [ "### Generate Three Different Datasets" ] }, { "cell_type": "code", "execution_count": null, "id": "part5-data", "metadata": {}, "outputs": [], "source": [ "np.random.seed(42)\n", "n = 100\n", "\n", "# Dataset 1: Good linear fit\n", "x1 = np.random.uniform(0, 10, n)\n", "y1 = 2 + 3*x1 + np.random.randn(n)*2\n", "\n", "# Dataset 2: Moderate fit (noisy)\n", "x2 = np.random.uniform(0, 10, n)\n", "y2 = 5 + 1.5*x2 + np.random.randn(n)*5\n", "\n", "# Dataset 3: Non-linear (quadratic) - BAD for linear model!\n", "x3 = np.linspace(-3, 3, n)\n", "y3 = x3**2 - 2*x3 + 1 + np.random.randn(n)*0.8\n", "\n", "print(\"Generated 3 datasets with different characteristics\")" ] }, { "cell_type": "markdown", "id": "activity5-header", "metadata": {}, "source": [ "### Activity 5: Diagnosing Model Fit\n", "\n", "For each dataset:\n", "1. Fit a linear model\n", "2. Calculate R²\n", "3. Create scatter plot + fitted line\n", "4. Create residual plot\n", "5. Judge if the linear model is appropriate" ] }, { "cell_type": "code", "execution_count": null, "id": "activity5", "metadata": {}, "outputs": [], "source": [ "datasets = [\n", " (x1, y1, 'Dataset 1: Good Fit'),\n", " (x2, y2, 'Dataset 2: Noisy Fit'),\n", " (x3, y3, 'Dataset 3: Non-linear')\n", "]\n", "\n", "results = []\n", "\n", "for x, y, name in datasets:\n", " print(f\"\\n{'='*60}\")\n", " print(f\"Analyzing: {name}\")\n", " print('='*60)\n", " \n", " # TODO: Prepare data\n", " X = None # YOUR CODE\n", " \n", " # TODO: Fit linear model\n", " model = None # YOUR CODE\n", " # YOUR CODE - fit the model\n", " \n", " # TODO: Get predictions and calculate R²\n", " y_pred = None # YOUR CODE\n", " r2 = None # YOUR CODE\n", " \n", " # TODO: Calculate residuals\n", " residuals = None # YOUR CODE\n", " \n", " print(f\"R² = {r2:.3f} ({r2*100:.1f}% of variance explained)\")\n", " \n", " # TODO: Create side-by-side plots\n", " fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))\n", " \n", " # Left plot: Scatter + fitted line\n", " # YOUR CODE - scatter plot on ax1\n", " # YOUR CODE - plot fitted line on ax1\n", " # YOUR CODE - labels and title for ax1\n", " ax1.grid(alpha=0.3)\n", " \n", " # Right plot: Residuals\n", " # YOUR CODE - scatter plot of residuals vs predictions on ax2\n", " # YOUR CODE - add horizontal line at y=0 on ax2\n", " # YOUR CODE - labels and title for ax2\n", " ax2.grid(alpha=0.3)\n", " \n", " plt.suptitle(f'{name} Analysis', fontsize=13, fontweight='bold')\n", " plt.tight_layout()\n", " plt.show()\n", " \n", " # TODO: Describe the residual pattern you see\n", " residual_pattern = \"???\" # YOUR DESCRIPTION\n", " \n", " results.append({\n", " 'name': name,\n", " 'r2': r2,\n", " 'residual_pattern': residual_pattern\n", " })" ] }, { "cell_type": "code", "execution_count": null, "id": "activity5-summary", "metadata": {}, "outputs": [], "source": [ "# TODO: Final assessment for each dataset\n", "print(\"\\n\" + \"=\"*60)\n", "print(\"SUMMARY: Is Linear Model Appropriate?\")\n", "print(\"=\"*60)\n", "\n", "for result in results:\n", " print(f\"\\n{result['name']}:\")\n", " print(f\" R² = {result['r2']:.3f}\")\n", " print(f\" Residual pattern: {result['residual_pattern']}\")\n", " print(f\" Appropriate? YOUR ANSWER HERE\")\n", " print(f\" Reason: YOUR EXPLANATION HERE\")\n", "\n", "print(\"\\n\" + \"=\"*60)\n", "print(\"KEY LESSON\")\n", "print(\"=\"*60)\n", "print(\"Linear models are ATTEMPTS to explain data.\")\n", "print(\"They don't always work!\")\n", "print(\"\\nAlways check:\")\n", "print(\" 1. R² - How much variance is explained?\")\n", "print(\" 2. Residual plots - Are there patterns?\")\n", "print(\" 3. Domain knowledge - Does the model make sense?\")" ] }, { "cell_type": "markdown", "id": "part6-header", "metadata": {}, "source": [ "---\n", "## Part 6: Wrap-Up and Looking Ahead" ] }, { "cell_type": "markdown", "id": "part6-4d", "metadata": {}, "source": [ "### Quick Glimpse: 4D and Beyond\n", "\n", "What about 4, 5, or even more dimensions? Models handle them easily!" ] }, { "cell_type": "code", "execution_count": null, "id": "part6-4d-demo", "metadata": {}, "outputs": [], "source": [ "# Generate 4D data\n", "np.random.seed(42)\n", "n = 150\n", "\n", "X1 = np.random.randn(n)\n", "X2 = np.random.randn(n)\n", "X3 = np.random.randn(n)\n", "X4 = np.random.randn(n)\n", "Y = 2 + 0.5*X1 + 1.2*X2 - 0.8*X3 + 0.3*X4 + np.random.randn(n)*0.5\n", "\n", "# Fit model\n", "X_4d = np.column_stack([X1, X2, X3, X4])\n", "model_4d = LinearRegression().fit(X_4d, Y)\n", "\n", "print(\"=\"*60)\n", "print(\"4D MODEL: Y ~ X1 + X2 + X3 + X4\")\n", "print(\"=\"*60)\n", "print(f\"\\nEquation: Y = {model_4d.intercept_:.2f}\", end=\"\")\n", "for i, coef in enumerate(model_4d.coef_, 1):\n", " sign = '+' if coef >= 0 else ''\n", " print(f\" {sign}{coef:.2f}×X{i}\", end=\"\")\n", "print()\n", "\n", "r2_4d = r2_score(Y, model_4d.predict(X_4d))\n", "print(f\"\\nR² = {r2_4d:.3f}\")\n", "print(f\"\\nWe can't visualize 4D relationships,\")\n", "print(f\"but the model handles it effortlessly!\")\n", "print(f\"\\nThis is the power of models: vocabulary that SCALES.\")" ] }, { "cell_type": "markdown", "id": "takeaways", "metadata": {}, "source": [ "### Key Takeaways\n", "\n", "```\n", "1. ONE-DIMENSIONAL DATA\n", " • Fully described by eCDF + summary statistics\n", " • We have complete vocabulary\n", "\n", "2. MULTI-DIMENSIONAL DATA \n", " • Can't be fully described with simple summaries\n", " • Visualization becomes difficult beyond 3D\n", " • Need models to capture relationships\n", "\n", "3. LINEAR MODELS\n", " • Give us equations: Y = β₀ + β₁X₁ + β₂X₂ + ... + error\n", " • Coefficients quantify relationships\n", " • Scale easily to many dimensions\n", "\n", "4. WORKING WITH MODELS\n", " • Residuals = what the model doesn't explain\n", " • R² = proportion of variance explained (0 to 1)\n", " • Always check if the model is appropriate!\n", "\n", "5. LIMITATIONS\n", " • Linear models assume linear relationships\n", " • Check residual plots for patterns\n", " • High R² doesn't always mean good model\n", "```" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.11" } }, "nbformat": 4, "nbformat_minor": 5 }