Pima Indians Diabetes Database¶

In this notebook, we illustrate black-box model explanation with the medical Pima Indians Diabetes Database dataset. There are eight features:

Number of times pregnant (Pregnancies)
Plasma glucose concentration a 2 hours in an oral glucose tolerance test (Glucose)
Diastolic blood pressure in mm Hg (BloodPressure)
Triceps skin fold thickness in mm (SkinThickness)
2-Hour serum insulin in mu U/ml (Insulin)
Body mass index measured as weight in kg/(height in m)^2 (BMI)
Diabetes pedigree function (DiabetesPedigreeFunction)
Age in years (Age)

The Diabetes Pedigree Function, pedi, provides some data on diabetes mellitus history in relatives and the genetic relationship of those relatives to the patient. This measure of genetic influence give us an idea of the hereditary risk one might have with the onset of diabetes mellitus.

In [1]:

import ethik

X, y = ethik.datasets.load_diabetes()
X.head()

Out[1]:

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age
0	6	148	72	35	0	33.6	0.627	50
1	1	85	66	29	0	26.6	0.351	31
2	8	183	64	0	0	23.3	0.672	32
3	1	89	66	23	94	28.1	0.167	21
4	0	137	40	35	168	43.1	2.288	33

In [2]:

y.head()

Out[2]:

0     True
1    False
2     True
3    False
4     True
Name: has_diabetes, dtype: bool

In [3]:

from sklearn import model_selection

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, shuffle=True, random_state=42)

This article explores multiple classification models.

In this notebook, we aim to illustrate explanability and will arbitrarily train a gradient-boosting tree using LightGBM.

In [4]:

import lightgbm as lgb
import pandas as pd

model = lgb.LGBMClassifier(random_state=42).fit(X_train, y_train)

y_pred = model.predict_proba(X_test)[:, 1]
# We use a named pandas series to make plot labels more explicit
y_pred = pd.Series(y_pred, name='has_diabetes')
y_pred.head()

Out[4]:

0    0.633970
1    0.017857
2    0.031330
3    0.061403
4    0.518020
Name: has_diabetes, dtype: float64

In [5]:

from sklearn import metrics

# As `y_test` is binary (0 or 1), we need to make `y_pred` binary as well
# for `metrics.accuracy_score` to work.
print(f'Accuracy score: {metrics.accuracy_score(y_test, y_pred > 0.5):.4f}')

Accuracy score: 0.7292

Let's plot the four most impactful features on the predictions:

In [6]:

explainer = ethik.ClassificationExplainer()
explainer.plot_influence_ranking(
    X_test=X_test,
    y_pred=y_pred,
    n_features=4,
)

100%|██████████| 328/328 [00:00<00:00, 530.61it/s]

The concentration of glucose is the most impactful feature on the probability of having diabetes. Let's have a look at the details:

In [7]:

explainer.plot_influence(
    X_test=X_test["Glucose"],
    y_pred=y_pred,
)

100%|██████████| 41/41 [00:00<00:00, 543.20it/s]

Let's compare with other impactful features:

In [8]:

explainer.plot_influence(
    X_test=X_test[["BloodPressure", "BMI", "Age", "Glucose"]],
    y_pred=y_pred,
)

100%|██████████| 164/164 [00:00<00:00, 537.04it/s]

The blood pressure curve is weird due to the fact that the optimization procedure didn't converge (as stated by the warning above). It may explain why it is considered as impactful (because it varies a lot, but for bad reasons).

In [ ]: