Ethik AI

Boston house prices

In this dataset we'll be looking at the Boston house prices dataset.

In [1]:
import pandas as pd
from sklearn import datasets

boston = datasets.load_boston()
X = pd.DataFrame(boston.data, columns=boston.feature_names)
y = pd.Series(boston.target, name='House price')

X.head()
Out[1]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33

As usual in machine learning, we want to diagnose our model on a test set. We will thus start by splitting our dataset in two.

In [2]:
from sklearn import model_selection

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, shuffle=True, random_state=42)

We'll train a random forest regressor. Like other ensemble methods, random forests are difficult to diagnose by themselves. However, because ethik is black-box method, it doesn't care about the complexity of the model.

In [3]:
from sklearn import ensemble

model = ensemble.RandomForestRegressor(n_estimators=100, max_depth=5)

model.fit(X_train, y_train)

y_pred = model.predict(X_test)
y_pred = pd.Series(y_pred, name=y_test.name)

The dataset contains a feature named "CRIM", which is the "per capita crime rate by town". It doesn't take a lot of imagination to guess that house prices are negatively correlated with crime rates.

In [4]:
import plotly.graph_objs as go

fig = go.Figure()

fig.add_trace(go.Scatter(
    x=X['CRIM'],
    y=y,
    mode='markers'
))

fig.update_layout(
    xaxis=dict(title='Crime rates'),
    yaxis=dict(title='House prices'),
    title='House prices vs. per capital crime rate'
)

The previous scatter plot shows that there seems to indeed be a negative correlation between house prices and crime rates. This is an insight that is based on the data, and is not necessarily what the model will pick up (although in this case, it will). With ethik you can use the plot_bias method to show what the average house price is predicted to be with respect to different crime rate values.

In [5]:
import ethik

explainer = ethik.RegressionExplainer(alpha=0.05)
explainer.plot_influence(
    X_test=X_test['CRIM'],
    y_pred=y_pred
)
100%|██████████| 41/41 [00:00<00:00, 593.65it/s]
In [6]:
explainer.plot_influence_ranking(
    X_test=X_test['CRIM'],
    y_pred=y_pred
)
100%|██████████| 41/41 [00:00<00:00, 599.45it/s]
In [ ]: