Boston house prices¶

In this dataset we'll be looking at the Boston house prices dataset.

In [1]:

import pandas as pd
from sklearn import datasets

boston = datasets.load_boston()
X = pd.DataFrame(boston.data, columns=boston.feature_names)
y = pd.Series(boston.target, name='House price')

X.head()

Out[1]:

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33

As usual in machine learning, we want to diagnose our model on a test set. We will thus start by splitting our dataset in two.

In [2]:

from sklearn import model_selection

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, shuffle=True, random_state=42)

We'll train a random forest regressor. Like other ensemble methods, random forests are difficult to diagnose by themselves. However, because ethik is black-box method, it doesn't care about the complexity of the model.

In [3]:

from sklearn import ensemble

model = ensemble.RandomForestRegressor(n_estimators=100, max_depth=5)

model.fit(X_train, y_train)

y_pred = model.predict(X_test)
y_pred = pd.Series(y_pred, name=y_test.name)

The dataset contains a feature named "CRIM", which is the "per capita crime rate by town". It doesn't take a lot of imagination to guess that house prices are negatively correlated with crime rates.

In [4]:

import plotly.graph_objs as go

fig = go.Figure()

fig.add_trace(go.Scatter(
    x=X['CRIM'],
    y=y,
    mode='markers'
))

fig.update_layout(
    xaxis=dict(title='Crime rates'),
    yaxis=dict(title='House prices'),
    title='House prices vs. per capital crime rate'
)

The previous scatter plot shows that there seems to indeed be a negative correlation between house prices and crime rates. This is an insight that is based on the data, and is not necessarily what the model will pick up (although in this case, it will). With ethik you can use the plot_bias method to show what the average house price is predicted to be with respect to different crime rate values.

In [5]:

import ethik

explainer = ethik.RegressionExplainer(alpha=0.05)
explainer.plot_influence(
    X_test=X_test['CRIM'],
    y_pred=y_pred
)

100%|██████████| 41/41 [00:00<00:00, 593.65it/s]

In [6]:

explainer.plot_influence_ranking(
    X_test=X_test['CRIM'],
    y_pred=y_pred
)

100%|██████████| 41/41 [00:00<00:00, 599.45it/s]

In [ ]: