Examples
(Esto habría que traducirlo al inglés digo yo...)
Información
Los ejemplos pueden encontrarse en formato .ipynb y .html en el siguiente enlace. También se encuentran los datasets necesarios y algún otro archivo .xlsx o .csv que se genera.
Recomiendo descargarlos para ir viendo como se ejecuta el código. Pongo el primero de ellos en esta documentación al ser un ejemplo ilustrativo con lo más básico.
Ejemplo sencillo
En este notebook se muestra la clase autoscorecard aplicada en un dataset de juguete.
En este dataset la variable objetivo está muy correlada con el resto de variables
Importamos los módulos
import numpy as np, pandas as pd, pyken as pyk
Cargamos el dataset separando las variables predictoras de la variable objetivo
from sklearn.datasets import load_breast_cancer as lbc
X, y = pd.DataFrame(lbc().data, columns=lbc().feature_names), lbc().target
Echamos un vistazo al dataset
X[X.columns[:10]].head()
| mean radius | mean texture | mean perimeter | mean area | mean smoothness | mean compactness | mean concavity | mean concave points | mean symmetry | mean fractal dimension | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.3001 | 0.14710 | 0.2419 | 0.07871 |
| 1 | 20.57 | 17.77 | 132.90 | 1326.0 | 0.08474 | 0.07864 | 0.0869 | 0.07017 | 0.1812 | 0.05667 |
| 2 | 19.69 | 21.25 | 130.00 | 1203.0 | 0.10960 | 0.15990 | 0.1974 | 0.12790 | 0.2069 | 0.05999 |
| 3 | 11.42 | 20.38 | 77.58 | 386.1 | 0.14250 | 0.28390 | 0.2414 | 0.10520 | 0.2597 | 0.09744 |
| 4 | 20.29 | 14.34 | 135.10 | 1297.0 | 0.10030 | 0.13280 | 0.1980 | 0.10430 | 0.1809 | 0.05883 |
No es buena práctica tener espacios en los nombres de las columnas, mejor ponemos guiones bajos
X.columns = [i.replace(' ', '_') for i in X.columns]
Aplicamos la clase autoscorecard para sacar el modelo automático
modelo1 = pyk.autoscorecard().fit(X, y)
Particionado 70-30 estratificado en el target terminado
------------------------------------------------------------------------------------------------------------------------------------------------------
Autogrouping terminado. Máximo número de buckets = 5. Mínimo porcentaje por bucket = 0.05
------------------------------------------------------------------------------------------------------------------------------------------------------
Step 01 | Time - 0:00:00.359722 | p-value = 4.93e-32 | Gini train = 83.97% | Gini test = 87.30% ---> Feature selected: mean_concavity
Step 02 | Time - 0:00:00.392056 | p-value = 1.38e-14 | Gini train = 96.82% | Gini test = 97.24% ---> Feature selected: worst_perimeter
Step 03 | Time - 0:00:00.405165 | p-value = 4.31e-06 | Gini train = 98.34% | Gini test = 98.07% ---> Feature selected: worst_texture
Step 04 | Time - 0:00:00.577206 | p-value = 5.11e-04 | Gini train = 98.92% | Gini test = 97.06% ---> Feature selected: worst_smoothness
Step 05 | Time - 0:00:00.706235 | p-value = 1.62e-03 | Gini train = 99.34% | Gini test = 98.51% ---> Feature selected: radius_error
Step 05 | Time - 0:00:00.000000 | p-value = 1.54e-02 | Gini train = 99.25% | Gini test = 98.22% ---> Feature deleted : mean_concavity
Step 06 | Time - 0:00:00.691006 | p-value = 2.28e-03 | Gini train = 99.60% | Gini test = 98.77% ---> Feature selected: worst_concavity
------------------------------------------------------------------------------------------------------------------------------------------------------
Ya ninguna variable tiene un p-valor < 0.01, detenemos el proceso
------------------------------------------------------------------------------------------------------------------------------------------------------
Selección terminada: ['worst_perimeter', 'worst_texture', 'worst_smoothness', 'radius_error', 'worst_concavity']
------------------------------------------------------------------------------------------------------------------------------------------------------
El modelo tiene un 95.55% de KS y un 99.60% de Gini en la muestra de desarrollo
------------------------------------------------------------------------------------------------------------------------------------------------------
El modelo tiene un 95.63% de KS y un 98.77% de Gini en la muestra de validación
------------------------------------------------------------------------------------------------------------------------------------------------------
Pintamos la scorecard con colorines
pyk.pretty_scorecard(modelo1)
| Variable | Group | Count | Percent | Goods | Bads | Bad rate | WoE | IV | Raw score | Aligned score | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | worst_perimeter | (-inf, 91.69) | 167 | 0.419598 | 2 | 165 | 0.988024 | -3.888550 | 2.513895 | 5.210485 | -48 |
| 1 | worst_perimeter | [91.69, 102.05) | 58 | 0.145729 | 5 | 53 | 0.913793 | -1.836605 | 0.327313 | 2.460970 | 32 |
| 2 | worst_perimeter | [102.05, 114.65) | 52 | 0.130653 | 23 | 29 | 0.557692 | 0.292447 | 0.011524 | -0.391866 | 114 |
| 3 | worst_perimeter | [114.65, inf) | 121 | 0.304020 | 118 | 3 | 0.024793 | 4.196321 | 3.295360 | -5.622885 | 265 |
| 4 | worst_texture | (-inf, 23.35) | 159 | 0.399497 | 19 | 140 | 0.880503 | -1.472955 | 0.635759 | 2.334961 | 35 |
| 5 | worst_texture | [23.35, 28.24) | 112 | 0.281407 | 54 | 58 | 0.517857 | 0.452790 | 0.060160 | -0.717772 | 123 |
| 6 | worst_texture | [28.24, 29.23) | 20 | 0.050251 | 5 | 15 | 0.750000 | -0.574364 | 0.015058 | 0.910494 | 76 |
| 7 | worst_texture | [29.23, 31.17) | 31 | 0.077889 | 26 | 5 | 0.161290 | 2.172907 | 0.338269 | -3.444541 | 202 |
| 8 | worst_texture | [31.17, inf) | 76 | 0.190955 | 44 | 32 | 0.421053 | 0.842702 | 0.142667 | -1.335871 | 141 |
| 9 | worst_smoothness | (-inf, 0.10) | 34 | 0.085427 | 1 | 33 | 0.970588 | -2.972259 | 0.372255 | 6.443867 | -83 |
| 10 | worst_smoothness | [0.10, 0.13) | 130 | 0.326633 | 33 | 97 | 0.746154 | -0.553955 | 0.091418 | 1.200976 | 68 |
| 11 | worst_smoothness | [0.13, 0.14) | 63 | 0.158291 | 11 | 52 | 0.825397 | -1.029100 | 0.137566 | 2.231092 | 38 |
| 12 | worst_smoothness | [0.14, 0.16) | 114 | 0.286432 | 60 | 54 | 0.473684 | 0.629609 | 0.119251 | -1.364995 | 142 |
| 13 | worst_smoothness | [0.16, inf) | 57 | 0.143216 | 43 | 14 | 0.245614 | 1.646391 | 0.386146 | -3.569382 | 206 |
| 14 | radius_error | (-inf, 0.24) | 111 | 0.278894 | 6 | 105 | 0.945946 | -2.337952 | 0.887158 | 3.846413 | -8 |
| 15 | radius_error | [0.24, 0.41) | 152 | 0.381910 | 35 | 117 | 0.769737 | -0.682577 | 0.158026 | 1.122980 | 70 |
| 16 | radius_error | [0.41, 0.48) | 32 | 0.080402 | 23 | 9 | 0.281250 | 1.462518 | 0.174633 | -2.406144 | 172 |
| 17 | radius_error | [0.48, 0.56) | 26 | 0.065327 | 12 | 14 | 0.538462 | 0.370098 | 0.009282 | -0.608887 | 120 |
| 18 | radius_error | [0.56, inf) | 77 | 0.193467 | 72 | 5 | 0.064935 | 3.191477 | 1.488781 | -5.250637 | 254 |
| 19 | worst_concavity | (-inf, 0.21) | 182 | 0.457286 | 4 | 178 | 0.978022 | -3.271241 | 2.240711 | 2.492782 | 31 |
| 20 | worst_concavity | [0.21, 0.26) | 39 | 0.097990 | 12 | 27 | 0.692308 | -0.286682 | 0.007717 | 0.218460 | 96 |
| 21 | worst_concavity | [0.26, 0.29) | 20 | 0.050251 | 14 | 6 | 0.300000 | 1.371547 | 0.096824 | -1.045159 | 133 |
| 22 | worst_concavity | [0.29, 0.38) | 53 | 0.133166 | 26 | 27 | 0.509434 | 0.486508 | 0.032925 | -0.370734 | 113 |
| 23 | worst_concavity | [0.38, inf) | 104 | 0.261307 | 92 | 12 | 0.115385 | 2.561131 | 1.469120 | -1.951657 | 159 |
El modelo tiene una hoja de cálculo con la scorecard pintanda con los formatos adecuados.
modelo1.save_excel('scorecard_ejemplo01.xlsx')