Heart Disease UCI¶

In this notebook, we illustrate black-box model explanation with the medical Heart Disease UCI dataset. There are forteen features:

age
sex
cp: chest paintype (4 values)
trestbps: resting blood pressure
chol: serum cholestoral in mg/dl
fbs: fasting blood sugar > 120 mg/dl
restecg: resting electrocardiographic results (values 0,1,2)
thalach: maximum heart rate achieved
exang: exercise induced angina
oldpeak: oldpeak = ST depression induced by exercise relative to rest
slope: the slope of the peak exercise ST segment
ca: number of major vessels (0-3) colored by flourosopy
thal: 3 = normal; 6 = fixed defect; 7 = reversable defect

The output is the presence (1) or absence (0) of heart disease.

In [1]:

import ethik

X, y = ethik.datasets.load_heart_disease()
X.head()

Out[1]:

	age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	thal
0	63	1	3	145	233	1	0	150	0	2.3	0	1
1	37	1	2	130	250	0	1	187	0	3.5	0	2
2	41	0	1	130	204	0	0	172	0	1.4	2	2
3	56	1	1	120	236	0	1	178	0	0.8	2	2
4	57	0	0	120	354	0	1	163	1	0.6	2	2

In [2]:

y.head()

Out[2]:

0    True
1    True
2    True
3    True
4    True
Name: has_heart_disease, dtype: bool

In [3]:

from sklearn import model_selection

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, shuffle=True, random_state=42)

In this notebook, we aim to illustrate explanability and will arbitrarily train a gradient-boosting tree using LightGBM.

In [4]:

import lightgbm as lgb
import pandas as pd

model = lgb.LGBMClassifier(random_state=42).fit(X_train, y_train)

y_pred = model.predict_proba(X_test)[:, 1]
# We use a named pandas series to make plot labels more explicit
y_pred = pd.Series(y_pred, name='has_heart_disease')
y_pred.head()

Out[4]:

0    0.006718
1    0.050039
2    0.883217
3    0.072895
4    0.921450
Name: has_heart_disease, dtype: float64

In [5]:

from sklearn import metrics

# As `y_test` is binary (0 or 1), we need to make `y_pred` binary as well
# for `metrics.accuracy_score` to work.
print(f'Accuracy score: {metrics.accuracy_score(y_test, y_pred > 0.5):.4f}')

Accuracy score: 0.8684

Let's plot the four most impactful features on the predictions:

In [6]:

explainer = ethik.ClassificationExplainer()
explainer.plot_influence_ranking(
    X_test=X_test,
    y_pred=y_pred,
    n_features=10,
)

/home/travis/virtualenv/python3.7.1/lib/python3.7/site-packages/ethik-0.0.4-py3.7.egg/ethik/query.py:144: ConstantWarning:

all the values of feature restecg = 2 are identical

/home/travis/virtualenv/python3.7.1/lib/python3.7/site-packages/ethik-0.0.4-py3.7.egg/ethik/query.py:144: ConstantWarning:

all the values of feature ca = 4 are identical

/home/travis/virtualenv/python3.7.1/lib/python3.7/site-packages/ethik-0.0.4-py3.7.egg/ethik/query.py:144: ConstantWarning:

all the values of feature thal = 0 are identical

100%|██████████| 1150/1150 [00:01<00:00, 671.24it/s]

The maximum heart rate achieved is the most impactful feature on the probability of having diabetes. Let's have a look at the details:

In [7]:

explainer.plot_influence(
    X_test=X_test["thalach"],
    y_pred=y_pred,
)

100%|██████████| 41/41 [00:00<00:00, 676.55it/s]

In [8]:

explainer.plot_influence(
    X_test=X_test["oldpeak"],
    y_pred=y_pred,
)

100%|██████████| 41/41 [00:00<00:00, 667.49it/s]

In [9]:

explainer.plot_influence(
    X_test=X_test["thal"],
    y_pred=y_pred,
)

100%|██████████| 144/144 [00:00<00:00, 669.64it/s]

In [10]:

explainer.plot_influence(
    X_test=X_test["cp"],
    y_pred=y_pred,
)

100%|██████████| 164/164 [00:00<00:00, 670.50it/s]

In [11]:

explainer.plot_influence(
    X_test=X_test[["thalach", "oldpeak", "thal"]],
    y_pred=y_pred,
)

100%|██████████| 226/226 [00:00<00:00, 673.84it/s]

In [ ]: