Pima Indians Diabetes Database¶
In this notebook, we illustrate black-box model explanation with the medical Pima Indians Diabetes Database dataset. There are eight features:
- Number of times pregnant (
Pregnancies
) - Plasma glucose concentration a 2 hours in an oral glucose tolerance test (
Glucose
) - Diastolic blood pressure in mm Hg (
BloodPressure
) - Triceps skin fold thickness in mm (
SkinThickness
) - 2-Hour serum insulin in mu U/ml (
Insulin
) - Body mass index measured as weight in kg/(height in m)^2 (
BMI
) - Diabetes pedigree function (
DiabetesPedigreeFunction
) - Age in years (
Age
)
The Diabetes Pedigree Function, pedi, provides some data on diabetes mellitus history in relatives and the genetic relationship of those relatives to the patient. This measure of genetic influence give us an idea of the hereditary risk one might have with the onset of diabetes mellitus.
import ethik
X, y = ethik.datasets.load_diabetes()
X.head()
y.head()
from sklearn import model_selection
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, shuffle=True, random_state=42)
This article explores multiple classification models.
In this notebook, we aim to illustrate explanability and will arbitrarily train a gradient-boosting tree using LightGBM.
import lightgbm as lgb
import pandas as pd
model = lgb.LGBMClassifier(random_state=42).fit(X_train, y_train)
y_pred = model.predict_proba(X_test)[:, 1]
# We use a named pandas series to make plot labels more explicit
y_pred = pd.Series(y_pred, name='has_diabetes')
y_pred.head()
from sklearn import metrics
# As `y_test` is binary (0 or 1), we need to make `y_pred` binary as well
# for `metrics.accuracy_score` to work.
print(f'Accuracy score: {metrics.accuracy_score(y_test, y_pred > 0.5):.4f}')
Let's plot the four most impactful features on the predictions:
explainer = ethik.ClassificationExplainer()
explainer.plot_influence_ranking(
X_test=X_test,
y_pred=y_pred,
n_features=4,
)
The concentration of glucose is the most impactful feature on the probability of having diabetes. Let's have a look at the details:
explainer.plot_influence(
X_test=X_test["Glucose"],
y_pred=y_pred,
)
Let's compare with other impactful features:
explainer.plot_influence(
X_test=X_test[["BloodPressure", "BMI", "Age", "Glucose"]],
y_pred=y_pred,
)
The blood pressure curve is weird due to the fact that the optimization procedure didn't converge (as stated by the warning above). It may explain why it is considered as impactful (because it varies a lot, but for bad reasons).