{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "

MANIPULAR DATOS DE TIEMPO ⌛⌛⌛

\n", "\n", "En este caso, utilizaremos el conjunto de datos de citas médicas alojado en Kaggle (https://www.kaggle.com/joniarroba/noshowappointments). Este conjunto de datos consta de más de 110.000 citas médicas. \n", "\n", "La columna principal que utilizaremos para este Notebook son el ScheduledDay (fecha y hora en el que se programó la cita). El objetivo es ver cómo podemos manipular una columna que refleja el tiempo para adaptarlo a las necesidades del futuro modelo de IA. " ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PatientIdAppointmentIDGenderScheduledDayAppointmentDayAgeNeighbourhoodScholarshipHipertensionDiabetesAlcoholismHandcapSMS_receivedNo-show
02.987250e+135642903F2016-04-29T18:38:08Z2016-04-29T00:00:00Z62JARDIM DA PENHA010000No
15.589978e+145642503M2016-04-29T16:08:27Z2016-04-29T00:00:00Z56JARDIM DA PENHA000000No
24.262962e+125642549F2016-04-29T16:19:04Z2016-04-29T00:00:00Z62MATA DA PRAIA000000No
38.679512e+115642828F2016-04-29T17:29:31Z2016-04-29T00:00:00Z8PONTAL DE CAMBURI000000No
48.841186e+125642494F2016-04-29T16:07:23Z2016-04-29T00:00:00Z56JARDIM DA PENHA011000No
\n", "
" ], "text/plain": [ " PatientId AppointmentID Gender ScheduledDay \\\n", "0 2.987250e+13 5642903 F 2016-04-29T18:38:08Z \n", "1 5.589978e+14 5642503 M 2016-04-29T16:08:27Z \n", "2 4.262962e+12 5642549 F 2016-04-29T16:19:04Z \n", "3 8.679512e+11 5642828 F 2016-04-29T17:29:31Z \n", "4 8.841186e+12 5642494 F 2016-04-29T16:07:23Z \n", "\n", " AppointmentDay Age Neighbourhood Scholarship Hipertension \\\n", "0 2016-04-29T00:00:00Z 62 JARDIM DA PENHA 0 1 \n", "1 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0 0 \n", "2 2016-04-29T00:00:00Z 62 MATA DA PRAIA 0 0 \n", "3 2016-04-29T00:00:00Z 8 PONTAL DE CAMBURI 0 0 \n", "4 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0 1 \n", "\n", " Diabetes Alcoholism Handcap SMS_received No-show \n", "0 0 0 0 0 No \n", "1 0 0 0 0 No \n", "2 0 0 0 0 No \n", "3 0 0 0 0 No \n", "4 1 0 0 0 No " ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "# Supongo que has guardado el dataset en la carpeta dataset ;), de lo contrario adaptar el path\n", "df = pd.read_csv(\"dataset/cita_medica.csv\")\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 2016-04-29T18:38:08Z\n", "1 2016-04-29T16:08:27Z\n", "2 2016-04-29T16:19:04Z\n", "3 2016-04-29T17:29:31Z\n", "4 2016-04-29T16:07:23Z\n", "Name: ScheduledDay, dtype: object" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.ScheduledDay.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Como se puede observar, el dtype de ScheduledDay es de tipo object, lo que significa que pandas entiende estos valores como strings. Para convertir estas cadenas en fechas podemos utilizar la función de pandas to_datetime. \n", "\n", "Utilizar el parámetro format para indicar específicamente el formato puede ser una buena decisión. Si usas el parámetro de formato, tienes que especificar qué hacer con los errores.\n", "\n" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [], "source": [ "df['ScheduledDay'] = pd.to_datetime(df['ScheduledDay'], \n", " format = '%Y-%m-%dT%H:%M:%SZ', \n", " errors = 'coerce')\n", "assert df.ScheduledDay.isnull().sum() == 0, 'missing ScheduledDay dates'" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 2016-04-29 18:38:08\n", "1 2016-04-29 16:08:27\n", "2 2016-04-29 16:19:04\n", "3 2016-04-29 17:29:31\n", "4 2016-04-29 16:07:23\n", "Name: ScheduledDay, dtype: datetime64[ns]" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Ver los cambios. Tipo datetime y un formato más legible\n", "df.ScheduledDay.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Al convertir las cadenas en datetimes podemos empezar a utilizar otras propiedades de Pandas: https://pandas.pydata.org/pandas-docs/version/0.23/api.html#datetimelike-properties\n", "\n", "Básicamente, con Pandas podrás desglosar la fecha y obtener el año, el mes, la semana del año, el día del mes, la hora, los minutos, los segundos, etcétera. También puede obtener el día de la semana (lunes = 0, domingo = 6)." ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [], "source": [ "df['Año_programado'] = df['ScheduledDay'].dt.year\n", "df['Mes_programado'] = df['ScheduledDay'].dt.month\n", "df['Semana_programada'] = df['ScheduledDay'].dt.isocalendar().week\n", "df['Día_programado'] = df['ScheduledDay'].dt.day\n", "df['Hora_programado'] = df['ScheduledDay'].dt.hour\n", "df['Minuto_programado'] = df['ScheduledDay'].dt.minute\n", "df['Día_semana_programado'] = df['ScheduledDay'].dt.dayofweek" ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Año_programadoMes_programadoSemana_programadaDía_programadoHora_programadoMinuto_programadoDía_semana_programado
020164172918384
12016417291684
220164172916194
320164172917294
42016417291674
\n", "
" ], "text/plain": [ " Año_programado Mes_programado Semana_programada Día_programado \\\n", "0 2016 4 17 29 \n", "1 2016 4 17 29 \n", "2 2016 4 17 29 \n", "3 2016 4 17 29 \n", "4 2016 4 17 29 \n", "\n", " Hora_programado Minuto_programado Día_semana_programado \n", "0 18 38 4 \n", "1 16 8 4 \n", "2 16 19 4 \n", "3 17 29 4 \n", "4 16 7 4 " ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[['Año_programado','Mes_programado','Semana_programada','Día_programado','Hora_programado',\n", " 'Minuto_programado','Día_semana_programado']].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Enhorabuena. Has conseguido organizar la información en columnas separadas, después te quedarás con las que te aporten valor.\n", "Si has llegado hasta aquí estás en un punto muy interesante de tu aprendizaje. Sigue empeñándote de este modo y conseguirás todo lo que te propongas.\n", "\n", "¡Fuerza! 💪💪💪" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }