{ "cells": [ { "cell_type": "code", "execution_count": null, "id": "3bd64325-118c-40a9-b01c-a3e8e012def0", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "import pandas as pd" ] }, { "cell_type": "markdown", "id": "f978a19c-b6d0-43f5-9cc7-2762d1b0d548", "metadata": {}, "source": [ "# Loading MNIST" ] }, { "cell_type": "code", "execution_count": null, "id": "24e7e0d0-6b3b-474e-bf3e-9779449ec92e", "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import load_digits\n", "X_all, y_all = load_digits(return_X_y=True)\n", "\n", "print(f\"The number of digits in the dataset: {X_all.shape[0]}\")\n", "print(f\"The class abundances:\")\n", "for i in range(10):\n", " print(f\"{i}: {sum(y_all==i)} digits\")" ] }, { "cell_type": "code", "execution_count": null, "id": "c7fc81ba-8dd0-485e-beb8-86ba789c260c", "metadata": {}, "outputs": [], "source": [ "def plot_a_digit(x):\n", " fig, ax = plt.subplots(figsize=(5,5))\n", " sns.heatmap(x.reshape(8,8), vmin=0, vmax=15, cmap='viridis', cbar=None)\n", " plt.show()" ] }, { "cell_type": "code", "execution_count": null, "id": "45b6998e-66c3-4823-8f53-3fe3883f520c", "metadata": {}, "outputs": [], "source": [ "index = 19\n", "plot_a_digit(X_all[index])\n", "print(f\"True class: {y_all[index]}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "97148c6e-a52a-4e3b-ae69-fe2de74f000d", "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, test_size=0.1, random_state=42)\n", "\n", "print(f\"The train-test split is {X_train.shape[0]} to {X_test.shape[0]} datapoints.\")" ] }, { "cell_type": "markdown", "id": "59c76aa3-957c-4b03-bd4c-dccd1559db74", "metadata": {}, "source": [ "# Logistic regression between two classes" ] }, { "cell_type": "markdown", "id": "3232b77c-76e3-4df2-9ad2-bcfc83f4e3ed", "metadata": {}, "source": [ "The logistic regression class:" ] }, { "cell_type": "code", "execution_count": null, "id": "160fb94d-d5d2-4a37-8802-9dcb60b6fd05", "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import LogisticRegression" ] }, { "cell_type": "markdown", "id": "4ff4a6e5-f8a5-4b8a-bef7-d034d2ef62ed", "metadata": {}, "source": [ "TODO:\n", "1. Train a logistic regression on `class_1 = 2` and `class_2 = 5` in the train split.\n", "2. Calculate the AUC on the test split using `sklearn.metrics.auc`.\n", "3. Plot the ROC using `sklearn.metrics.RocCurveDisplay.from_estimator`.\n", "4. Visualize an example of a misclassified 2 and a misclassified 5." ] }, { "cell_type": "markdown", "id": "5bb4a7c3-ffe1-4653-aaca-920af1427638", "metadata": {}, "source": [ "TODO:\n", "1. Wrap the code above in a function accepting `class_1` and `class_2` variables.\n", "2. Test the logistic regression performance for 3's and 2's.\n", "3. Test the logistic regression performance for 1's and 7's.\n", "4. Test the logistic regression performance for 0's and 9's." ] }, { "cell_type": "markdown", "id": "a1f91730-3d38-4802-af93-cb54d15ef874", "metadata": {}, "source": [ "# Logistic regression for multiple classes" ] }, { "cell_type": "markdown", "id": "121e614d-04c1-40c5-bf26-c3ea1f772b73", "metadata": {}, "source": [ "TODO:\n", "1. Using the same `LogisticRegression` class, train a logistic regression on all digit classes.\n", "2. Calculate and visualize the confusion matrix using `sklearn.metrics.confusion_matrix`.\n", "3. Calculate the F1 score of the regression using `sklearn.metrics.f1_score`.\n", "4. Visualize an example of a misclassified digit for top-5 confusion matrix entries." ] }, { "cell_type": "markdown", "id": "c2865922-00db-4887-8fcf-532ef71d4594", "metadata": {}, "source": [ "# k-means" ] }, { "cell_type": "code", "execution_count": null, "id": "1a6034b0-bf7b-46bd-aa3f-db3b6360d869", "metadata": {}, "outputs": [], "source": [ "from sklearn.cluster import KMeans" ] }, { "cell_type": "markdown", "id": "39bc5bdd-7931-47b5-9905-7a218ec96032", "metadata": {}, "source": [ "TODO:\n", "1. Cluster the train split into 10 clusters.\n", "2. Calculate and visualize the \"confusion matrix\" between the cluster labels and data labels.\n", "3. Use the centroids to classify the test split.\n", "4. Calculate the F1 score of the obtained classification." ] }, { "cell_type": "markdown", "id": "50b6ee5f-14d8-4e84-b97f-5887f0edca8b", "metadata": {}, "source": [ "# kNN" ] }, { "cell_type": "markdown", "id": "8659ca0e-17d4-44ec-9b4a-e2fbd01fad72", "metadata": {}, "source": [ "The k Nearest Neighbors Classifier class:" ] }, { "cell_type": "code", "execution_count": null, "id": "2efd3f56-bbe8-402b-b7bb-697a115a6ceb", "metadata": {}, "outputs": [], "source": [ "from sklearn.neighbors import KNeighborsClassifier" ] }, { "cell_type": "markdown", "id": "0a93b2ee-ea69-4135-9dac-70311badee71", "metadata": {}, "source": [ "TODO:\n", "1. Fit the nearest neighbors classifier with `k=5` on the train split and evaluate it on the test split.\n", "2. Calculate and visualize the confusion matrix.\n", "3. Calculate the F1 score.\n", "4. Compare the confusion matrices and the scores for `k=2`, `k=5`, `k=10` and `k=20`." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.11" } }, "nbformat": 4, "nbformat_minor": 5 }