Boston house prices¶
In this dataset we'll be looking at the Boston house prices dataset.
import pandas as pd
from sklearn import datasets
boston = datasets.load_boston()
X = pd.DataFrame(boston.data, columns=boston.feature_names)
y = pd.Series(boston.target, name='House price')
X.head()
As usual in machine learning, we want to diagnose our model on a test set. We will thus start by splitting our dataset in two.
from sklearn import model_selection
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, shuffle=True, random_state=42)
We'll train a random forest regressor. Like other ensemble methods, random forests are difficult to diagnose by themselves. However, because ethik
is black-box method, it doesn't care about the complexity of the model.
from sklearn import ensemble
model = ensemble.RandomForestRegressor(n_estimators=100, max_depth=5)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_pred = pd.Series(y_pred, name=y_test.name)
The dataset contains a feature named "CRIM", which is the "per capita crime rate by town". It doesn't take a lot of imagination to guess that house prices are negatively correlated with crime rates.
import plotly.graph_objs as go
fig = go.Figure()
fig.add_trace(go.Scatter(
x=X['CRIM'],
y=y,
mode='markers'
))
fig.update_layout(
xaxis=dict(title='Crime rates'),
yaxis=dict(title='House prices'),
title='House prices vs. per capital crime rate'
)
The previous scatter plot shows that there seems to indeed be a negative correlation between house prices and crime rates. This is an insight that is based on the data, and is not necessarily what the model will pick up (although in this case, it will). With ethik
you can use the plot_bias
method to show what the average house price is predicted to be with respect to different crime rate values.
import ethik
explainer = ethik.RegressionExplainer(alpha=0.05)
explainer.plot_influence(
X_test=X_test['CRIM'],
y_pred=y_pred
)
explainer.plot_influence_ranking(
X_test=X_test['CRIM'],
y_pred=y_pred
)