{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### Práctica Regularización" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Objetivo: predecir la tasa de violencia criminal para una comunidad basada en datos socioeconómicos y legales" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Cargamos librerías y el set de datos" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123456789...118119120121122123124125126127
08NaNNaNLakewoodcity10.190.330.020.900.12...0.120.260.200.060.040.90.50.320.140.20
153NaNNaNTukwilacity10.000.160.120.740.45...0.020.120.45NaNNaNNaNNaN0.00NaN0.67
224NaNNaNAberdeentown10.000.420.490.560.17...0.010.210.02NaNNaNNaNNaN0.00NaN0.43
3345.081440.0Willingborotownship10.040.771.000.080.12...0.020.390.28NaNNaNNaNNaN0.00NaN0.12
44295.06096.0Bethlehemtownship10.010.550.020.950.09...0.040.090.02NaNNaNNaNNaN0.00NaN0.03
\n", "

5 rows × 128 columns

\n", "
" ], "text/plain": [ " 0 1 2 3 4 5 6 7 8 9 \\\n", "0 8 NaN NaN Lakewoodcity 1 0.19 0.33 0.02 0.90 0.12 \n", "1 53 NaN NaN Tukwilacity 1 0.00 0.16 0.12 0.74 0.45 \n", "2 24 NaN NaN Aberdeentown 1 0.00 0.42 0.49 0.56 0.17 \n", "3 34 5.0 81440.0 Willingborotownship 1 0.04 0.77 1.00 0.08 0.12 \n", "4 42 95.0 6096.0 Bethlehemtownship 1 0.01 0.55 0.02 0.95 0.09 \n", "\n", " ... 118 119 120 121 122 123 124 125 126 127 \n", "0 ... 0.12 0.26 0.20 0.06 0.04 0.9 0.5 0.32 0.14 0.20 \n", "1 ... 0.02 0.12 0.45 NaN NaN NaN NaN 0.00 NaN 0.67 \n", "2 ... 0.01 0.21 0.02 NaN NaN NaN NaN 0.00 NaN 0.43 \n", "3 ... 0.02 0.39 0.28 NaN NaN NaN NaN 0.00 NaN 0.12 \n", "4 ... 0.04 0.09 0.02 NaN NaN NaN NaN 0.00 NaN 0.03 \n", "\n", "[5 rows x 128 columns]" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Importamos las librerías requeridas\n", "import numpy as np \n", "import pandas as pd\n", "import matplotlib.pyplot \n", "%matplotlib inline\n", "import seaborn as sns \n", "import warnings \n", "warnings.filterwarnings('ignore')\n", "\n", "# Cargamos el dataset y mostramos las 5 primeras filas\n", "url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/communities/communities.data'\n", "df = pd.read_csv(url, header=None, na_values=['?'])\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Corregimos los posibles errores del datasets y limpiamos los valores NaN" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "8128" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Comprobamos si hay valores que falten\n", "sum(df.isnull())" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# Nos quedamos con los valores que vamos a usar para la regresión (las columnas 0 - 4 sobran)\n", "df.drop([0,1,2,3,4],axis=1,inplace=True)\n", "# Borramos las filas donde hay valores NaN o faltantes\n", "df.dropna(inplace=True)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Dividimos los datos en un set de entrenamiento y otro de test" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# Definimos x e y\n", "X = df.drop(127, axis=1) # x (features) son todas las columnas menos la última\n", "y = df[127] # la variable target es la última columna del dataset\n", "\n", "# Dividimos los datos (80% entrenamiento y 20% test)\n", "from sklearn.model_selection import train_test_split\n", "X_train, X_test, y_train, y_test = train_test_split(X, y,test_size = 0.2, random_state=1)\n", "\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "##### Probamos a entrenar un modelo de regresión lineal" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "intersección : 1.2222288458382677\n", "coeficientes : [-4.42589321e+00 7.43581698e-01 -2.37354783e-01 -3.42019325e-01\n", " -1.92952888e-01 1.97882196e-01 -5.69565279e-01 -2.39234862e-01\n", " 2.56346323e-01 -2.53019568e-01 4.97805236e+00 -2.29564054e-01\n", " 1.04231718e+00 -2.00093384e-01 -2.72480752e-01 -1.07373823e+00\n", " 2.77147368e-01 6.96250848e-01 4.66638644e-02 9.41723218e-01\n", " -3.83963043e-01 -4.85962474e-01 -4.58632583e-01 1.11270348e-01\n", " -1.56014481e-01 3.68813946e-01 -1.00864287e-01 -6.72656722e-01\n", " 1.10883889e+00 1.68323233e-01 -1.21950733e+00 -2.53471804e-01\n", " 7.33442742e-02 7.15942305e-01 -1.81081931e-01 9.20512294e-02\n", " 4.17571224e-01 7.06569303e-01 1.03353131e+00 -1.47913599e-01\n", " 8.47063463e-01 -2.34573029e+00 1.37140832e-02 -9.78225533e-01\n", " -1.08738443e+00 9.17727824e-01 1.77832424e-01 -6.84632936e-01\n", " 5.71217036e-01 -2.30589336e-02 2.24301713e-02 -1.17062479e-01\n", " 5.60860209e-01 -1.07789537e+00 1.29415850e-01 6.06418661e-01\n", " -2.07511082e-01 2.04985247e-01 2.04774567e+00 -3.16661876e+00\n", " 5.26837428e-01 2.98608237e-01 1.07977557e+00 -1.59772999e+00\n", " -1.12444974e+00 7.81704511e-01 -7.26444448e-01 -1.35272710e+00\n", " 7.18329702e-01 6.50923828e-01 6.38965178e-02 9.46239738e-02\n", " 7.71051910e-02 1.95203655e+00 -6.24659355e-02 1.35599408e-01\n", " -4.15138137e-02 -2.16481730e-02 -5.94251053e-02 -9.18595491e-01\n", " 9.51278425e-03 2.89278256e-01 -9.14643148e-01 -1.27240170e-01\n", " -9.38085578e-02 1.57745773e+00 -4.43208779e-01 -1.66393874e-01\n", " -1.74667104e-01 -1.25313859e-01 2.72874223e-01 1.37604907e+00\n", " -2.59741018e-01 -4.17853470e-03 -3.60866853e-03 2.88366529e-01\n", " -1.07377771e+00 -4.16844937e+01 -7.05207857e-01 5.14728510e-01\n", " -5.19979911e-02 2.28495008e-01 3.78920697e-02 4.13470554e+01\n", " -8.17249421e-02 6.07900221e-02 1.41255887e-01 2.48275018e-01\n", " 7.69400764e-02 -4.33657951e-01 -2.23909937e-01 -1.21525932e-02\n", " -4.88628128e-03 1.26472299e-01 3.08685177e-02 -1.96525851e-01\n", " 2.09160515e-01 3.86041352e-01 -1.96104696e-01 1.94143987e-02\n", " -4.04459022e-02 -3.50551983e-01]\n" ] } ], "source": [ "from sklearn.linear_model import LinearRegression\n", "linreg = LinearRegression()\n", "linreg.fit(X_train, y_train)\n", "LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)\n", "print (\"intersección : \",linreg.intercept_)\n", "print (\"coeficientes : \",linreg.coef_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Probamos a entrenar un modelo regularizado Ridge" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "intersección : 0.6031478392948995\n", "coeficientes : [-4.19420712e-03 4.87518242e-02 4.93277241e-02 -6.70439837e-02\n", " -2.56807909e-02 3.00891598e-02 -2.26437683e-03 4.88607036e-03\n", " -1.17402526e-02 5.39130920e-02 -1.96111356e-03 -5.04991706e-02\n", " 6.16101766e-02 -1.25202709e-01 -1.11263671e-01 -3.21017929e-01\n", " -5.26803286e-03 1.00386441e-01 -8.51226741e-02 8.87088856e-02\n", " -3.24830032e-02 -8.29007851e-03 -1.63041710e-01 4.96803864e-02\n", " -8.63347638e-03 1.98768114e-01 4.55137112e-02 -1.02017480e-01\n", " 3.25293355e-02 -1.08495117e-01 -1.15823610e-01 -8.16234720e-03\n", " 1.06339638e-01 -1.60554977e-02 -1.30128561e-01 8.00976191e-02\n", " 8.08618327e-02 1.66357401e-02 6.62985387e-02 7.09198590e-02\n", " -4.22803329e-02 9.34179907e-03 8.50660007e-02 -1.41521793e-01\n", " -1.47018239e-01 -5.95817501e-02 -8.07355427e-02 -2.90576174e-02\n", " 7.03720246e-02 -5.68810174e-02 1.42026318e-01 -5.85527884e-02\n", " -4.12406651e-03 -1.14260674e-01 5.92227293e-02 8.67904515e-02\n", " -4.80438851e-03 1.58145107e-03 1.93402940e-02 8.88389068e-03\n", " -2.18674295e-02 4.50228317e-02 9.75777767e-03 -1.78878049e-02\n", " -2.29713411e-02 6.03887041e-02 -1.28109754e-01 3.66687918e-02\n", " 1.55046925e-02 2.81857246e-01 4.51356343e-02 4.20600637e-02\n", " 1.00184986e-02 5.12544746e-02 -1.71356924e-02 1.89383121e-02\n", " 8.69722062e-03 -3.15611513e-03 -2.48044296e-02 -3.18731647e-02\n", " -2.90068757e-02 -1.65971105e-02 -1.45121076e-01 5.69538010e-02\n", " 2.18824004e-02 1.22628804e-01 -5.70569742e-02 -8.91758427e-02\n", " -8.51316019e-02 5.63041486e-02 1.52070055e-01 5.34250327e-02\n", " -7.75421643e-02 6.11291228e-02 9.38270536e-03 7.15600333e-02\n", " -4.37670654e-02 7.22174267e-03 3.04935355e-02 3.01927801e-03\n", " -1.98311633e-02 4.65131886e-02 1.00229495e-01 1.04912203e-02\n", " -1.23538505e-01 -3.47082183e-02 -1.69726120e-02 4.21965675e-02\n", " 7.44432418e-02 -4.76095635e-02 -7.61878839e-05 2.16664985e-02\n", " -1.87098661e-02 5.56638538e-02 -4.28450228e-02 -6.36354062e-02\n", " 1.04047461e-01 8.11140414e-02 -2.58848256e-02 3.72246491e-02\n", " -8.23429025e-02 -6.75471121e-02]\n" ] } ], "source": [ "# alpha=0 is equivalent to linear regression\n", "from sklearn.linear_model import Ridge\n", "# Probamos con alpha(equivalente a lambda) alpha=0.1\n", "ridgereg = Ridge(alpha=0.1, normalize=True)\n", "ridgereg.fit(X_train, y_train)\n", "y_pred = ridgereg.predict(X_test)\n", "print (\"intersección : \",ridgereg.intercept_)\n", "print (\"coeficientes : \",ridgereg.coef_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Vemos el efecto en los coeficientes de la regularización Ridge. Todos se han aproximado a valores entorno al cero." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Probamos a entrenar un modelo regularizado Lasso" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[ 0. 0. 0. -0.17297361 0. 0.\n", " 0. 0. 0. -0. 0. 0.\n", " -0. -0. -0. -0. -0. 0.\n", " -0. -0. -0. -0. -0. -0.\n", " -0. -0. -0. 0. 0. 0.\n", " 0. -0. 0. -0. -0. 0.\n", " 0. -0. 0. 0. 0. 0.\n", " 0. -0. -0.50465657 -0. -0. -0.\n", " -0. 0. 0. 0. 0. 0.\n", " 0. 0. 0. 0. 0. 0.\n", " -0. 0. 0. 0. 0. 0.\n", " 0. -0. 0. 0. -0. 0.\n", " -0. -0. 0. 0. -0. 0.\n", " 0. -0. -0. -0. -0. -0.\n", " -0. -0. 0. 0. -0. 0.\n", " 0. 0. -0. -0. 0. 0.\n", " 0. 0. -0. 0. 0. 0.\n", " 0. 0. -0. -0. 0. 0.\n", " 0. 0. 0. 0. 0. 0.\n", " 0. 0. 0. 0. -0. 0.\n", " -0. 0. ]\n" ] } ], "source": [ "from sklearn.linear_model import Lasso\n", "lassoreg = Lasso(alpha=0.005, normalize=True)\n", "lassoreg.fit(X_train, y_train)\n", "print(lassoreg.coef_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Con una regularización Lasso, muchos de los coeficientes son directamente cero." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Ahora juega tú a cambiar los valores de alpha y descubre cómo varían los coeficientes. Te animo también a que pruebes Ridge y Lasso con validación cruzada RidgeCV y LassoCV." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" } }, "nbformat": 4, "nbformat_minor": 4 }