Ethik AI

 Pima Indians Diabetes Database

In this notebook, we illustrate black-box model explanation with the medical Pima Indians Diabetes Database dataset. There are eight features:

  • Number of times pregnant (Pregnancies)
  • Plasma glucose concentration a 2 hours in an oral glucose tolerance test (Glucose)
  • Diastolic blood pressure in mm Hg (BloodPressure)
  • Triceps skin fold thickness in mm (SkinThickness)
  • 2-Hour serum insulin in mu U/ml (Insulin)
  • Body mass index measured as weight in kg/(height in m)^2 (BMI)
  • Diabetes pedigree function (DiabetesPedigreeFunction)
  • Age in years (Age)

The Diabetes Pedigree Function, pedi, provides some data on diabetes mellitus history in relatives and the genetic relationship of those relatives to the patient. This measure of genetic influence give us an idea of the hereditary risk one might have with the onset of diabetes mellitus.

In [1]:
import ethik

X, y = ethik.datasets.load_diabetes()
X.head()
Out[1]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age
0 6 148 72 35 0 33.6 0.627 50
1 1 85 66 29 0 26.6 0.351 31
2 8 183 64 0 0 23.3 0.672 32
3 1 89 66 23 94 28.1 0.167 21
4 0 137 40 35 168 43.1 2.288 33
In [2]:
y.head()
Out[2]:
0     True
1    False
2     True
3    False
4     True
Name: has_diabetes, dtype: bool
In [3]:
from sklearn import model_selection

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, shuffle=True, random_state=42)

This article explores multiple classification models.

In this notebook, we aim to illustrate explanability and will arbitrarily train a gradient-boosting tree using LightGBM.

In [4]:
import lightgbm as lgb
import pandas as pd

model = lgb.LGBMClassifier(random_state=42).fit(X_train, y_train)

y_pred = model.predict_proba(X_test)[:, 1]
# We use a named pandas series to make plot labels more explicit
y_pred = pd.Series(y_pred, name='has_diabetes')
y_pred.head()
Out[4]:
0    0.633970
1    0.017857
2    0.031330
3    0.061403
4    0.518020
Name: has_diabetes, dtype: float64
In [5]:
from sklearn import metrics

# As `y_test` is binary (0 or 1), we need to make `y_pred` binary as well
# for `metrics.accuracy_score` to work.
print(f'Accuracy score: {metrics.accuracy_score(y_test, y_pred > 0.5):.4f}')
Accuracy score: 0.7292

Let's plot the four most impactful features on the predictions:

In [6]:
explainer = ethik.ClassificationExplainer()
explainer.plot_influence_ranking(
    X_test=X_test,
    y_pred=y_pred,
    n_features=4,
)
100%|██████████| 328/328 [00:00<00:00, 530.61it/s]

The concentration of glucose is the most impactful feature on the probability of having diabetes. Let's have a look at the details:

In [7]:
explainer.plot_influence(
    X_test=X_test["Glucose"],
    y_pred=y_pred,
)
100%|██████████| 41/41 [00:00<00:00, 543.20it/s]

Let's compare with other impactful features:

In [8]:
explainer.plot_influence(
    X_test=X_test[["BloodPressure", "BMI", "Age", "Glucose"]],
    y_pred=y_pred,
)
100%|██████████| 164/164 [00:00<00:00, 537.04it/s]

The blood pressure curve is weird due to the fact that the optimization procedure didn't converge (as stated by the warning above). It may explain why it is considered as impactful (because it varies a lot, but for bad reasons).

In [ ]: