Adult income classification¶
In the following notebook we'll be wortking with the "Adult" dataset. This dataset contains a binary label indicating if a person's annual income is larger than $50k per year. The data is available on the UCI machine learning repository.
This notebook explains how to use the package but not how it works under the hood. To learn more about that, please read the "How It Works" page.
import pandas as pd
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'
names = [
'age', 'workclass', 'fnlwgt', 'education',
'education-num', 'marital-status', 'occupation',
'relationship', 'race', 'gender', 'capital-gain',
'capital-loss', 'hours-per-week', 'native-country',
'salary'
]
dtypes = {
'workclass': 'category',
'education': 'category',
'marital-status': 'category',
'occupation': 'category',
'relationship': 'category',
'race': 'category',
'gender': 'category',
'native-country': 'category'
}
X = pd.read_csv(url, names=names, header=None, dtype=dtypes)
X['gender'] = X['gender'].str.strip().astype('category') # Remove leading whitespace
y = X.pop('salary').map({' <=50K': False, ' >50K': True})
X.head()
ethik
analyzes a model based on the predictions it makes on a test set. Consequently, we first have to split our dataset in two.
from sklearn import model_selection
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, shuffle=True, random_state=42)
We will now train a classifier using LightGBM.
import lightgbm as lgb
model = lgb.LGBMClassifier(random_state=42).fit(X_train, y_train)
We can now make predictions for the test set. We'll use a variable named y_pred
to store the predicted probabilities associated with the True
label.
y_pred = model.predict_proba(X_test)[:, 1]
# We use a named pandas series to make plot labels more explicit
y_pred = pd.Series(y_pred, name='>$50k')
We can now fit an Explainer
using the features from the test set. This will analyze the distribution of each feature and build a set of lambda
coefficients which can be used to explain model predictions.
import ethik
explainer = ethik.ClassificationExplainer()
Understanding model predictions¶
We can start by looking at how the probability of having a higher income changes with respect to the education-num
variable, as perceived by the model.
explainer.plot_influence(
X_test=X_test['education-num'],
y_pred=y_pred
)
Clearly we see that the model believes that the probability of having a salary above $50k increases with the amount of education. Although this might seem like an obvious statement, it's good to confirm that the model is seeing it. Moreover, it's helpful to be able to quantify by how much the model changes it's predictions.
To plot multiple charts in the same cell, we need to call the .show()
method:
explainer.plot_influence(
X_test=X_test['age'],
y_pred=y_pred
).show()
explainer.plot_influence(
X_test=X_test['education-num'],
y_pred=y_pred
).show()
We can also plot the distribution of predictions for more than one variable. However, because different variables have different scales we have to use a common measure to display them together. For this purpose we plot the τ ("tau") values. These values are contained between -1 and 1 and simply reflect by how much the variable is shifted from it's mean towards it's lower and upper quantiles. In the following figure a tau value of -1 corresponds to just under 20 years old whereas a tau value of 1 refers to being slightly over 60 years old.
explainer.plot_influence(
X_test=X_test[['age', 'hours-per-week', 'education-num']],
y_pred=y_pred,
colors={
'age': 'red',
'hours-per-week': 'green',
'education-num': 'blue'
}
)
Try and click on the lines to update the top x-axis.
One of the uses of these kinds of plots is to see if variables affect the outcome on average or not. Indeed, the straighter the lines, the less the associated variable has an impact on the average outcome. This is very handy to know if said variable is, say, a social trait such as the ethinicity and the target is a credit score. In this case, ethik
can be used to visualize and quantify the bias of the model with respect to the social trait.
We can also get an overview of features' importance and determine which ones impact the predictions the most:
explainer.plot_influence_ranking(
X_test=X_test[['age', 'education-num', 'hours-per-week', 'gender']],
y_pred=y_pred,
)
The importance is computed as the average absolute difference in bias changes per tau increase. If the curves plotted above are horizontal lines, we can conclude that the corresponding features do not impact the predictions at all. So, to compute the importance of a feature, we compute the distance to this horizontal line. In other words, the less the curve is flat, the more it is deemed important by the model.
When there are a lot of features, we can use the parameter n_features
:
explainer.plot_influence_ranking(
X_test=X_test[['age', 'hours-per-week', 'education-num', 'gender']],
y_pred=y_pred,
n_features=3,
)
Evaluating model reliability¶
ethik
can also be used to assess the reliability of a model with respect to a variable. Let us first evaluate the global performance of our model.
from sklearn import metrics
# As `y_test` is binary (0 or 1), we need to make `y_pred` binary as well
# for `metrics.accuracy_score` to work.
print(f'Accuracy score: {metrics.accuracy_score(y_test, y_pred > 0.5):.4f}')
With ethik
we can see how performant the model is with respect to a variable, for example age
.
explainer.plot_performance(
X_test=X_test['age'],
y_test=y_test,
y_pred=y_pred > 0.5,
metric=metrics.accuracy_score,
)
We can see that although the overall accuracy is around 0.88, it is much higher when the age is lower. This is quite intuitive, as we can imagine that young adults more often than not have a salary under $50k. When they get older, many things can happen and their salary isn't as easy to guess, which translates to a lower model accuracy.
In the same way as before, we can visualize the performance of the metric with respect to multiple variables.
explainer.plot_performance(
X_test=X_test[['age', 'education-num']],
y_test=y_test,
y_pred=y_pred > 0.5,
metric=metrics.accuracy_score,
)
These kind of reliability plots can be used for many cases. For example, now that we know the model is less accurate for older people than young ones, we might want to focus our data analysis on older people in order to extract helpful features. ethik
can thus help guide a data science project by telling you where your model is failing to perform.
We can also rank features by their impact on performance. Here, we want to show how bad the model can be when we make each feature's mean change:
explainer.plot_performance_ranking(
X_test=X_test[['age', 'education-num', 'hours-per-week', 'gender']],
y_test=y_test,
y_pred=y_pred > 0.5,
metric=metrics.accuracy_score,
criterion='min', # We are looking at the worst accuracy
)
This plot tells us that whatever the mean age of the dataset is (among the ones we computed), the model's accuracy is at least 86%. We also notice that changing the mean of education-num
can lead to worse performance than changing the mean age. In other words the above barplot displays potential accuracy scores in worst case scenarios.
To plot the n
features for which the model reaches the lowest accuracy, we can use the parameter n_features
:
explainer.plot_performance_ranking(
X_test=X_test[['age', 'education-num', 'hours-per-week', 'gender']],
y_test=y_test,
y_pred=y_pred > 0.5,
metric=metrics.accuracy_score,
criterion='min', # We are looking at the worst accuracy
n_features=-2, # We plot the two features with the smallest score in the ranking
)
Categorical features¶
Until now, we just manipulated numeric features. But ethik can also compute the influence of categorical features, that must have either the type object
or category
in the dataframe:
X_test['gender'].head()
explainer.plot_influence(
X_test=X_test['gender'],
y_pred=y_pred,
)
From the categorical feature gender
, two numeric features are created: one per category, which represents the proportion of this category in the dataset (between 0 and 1). Since we only have two possible values for gender
, the resulting numeric features are symmetric.
Not surprisingly, but sadly, we can see that according to the model, men have a higher chance of earning $50k a year. The model is simply reproducing the bias that is contained in the dataset. Let's still notice that this is a correlation and that causality has not been proven at this stage.
We can plot the performance as well:
explainer.plot_performance(
X_test=X_test['gender'],
y_test=y_test,
y_pred=y_pred > 0.5,
metric=metrics.accuracy_score,
)
To plot a single category, we use ethik.extract_category()
:
explainer.plot_influence(
X_test=ethik.extract_category(X_test['gender'], 'Male'),
y_pred=y_pred,
)
Robustness¶
Right. We have observed both the model's bias and it's performance. How can we be sure that these estimates are reliable? One criterion to trust an algorithm is its robustness, i.e. the fact that it gives similar outputs for similar inputs.
To check the robustness in ethik
, we can compute a confidence interval on the explanation:
- Get
p
% of the lines in the dataset - Compute the explanation (i.e. the bias or the performance)
- Do that
n
times - Have a look at the distribution of the explanations
Let's do it for n = 30
and p = 0.8
(the default):
# Please check the API reference for further details
explainer = ethik.ClassificationExplainer(n_samples=30)
Then we can explain and plot the bias simply as before. To compute the confidence interval, we consider the 5% and 95% quantiles (this can be configured, see API reference):
explainer.plot_influence(
X_test=X_test['education-num'],
y_pred=y_pred,
)
The plotted line is the mean. The confidence interval is so small that we need to zoom in to see it. This means that the algorithm is quite robust on this dataset (which only contains the education-num
feature).
The algorithm is a little less robust to compute the performance:
explainer.plot_performance(
X_test=X_test['education-num'],
y_test=y_test,
y_pred=y_pred > 0.5,
metric=metrics.accuracy_score,
)
Comparing individuals¶
Let's consider two individuals of the dataset:
bob = X_test.iloc[2].rename("bob")
bob
mary = X_test.iloc[1].rename("mary")
mary
To visualize how the model behaves for Bob and Mary, we can plot the curves like we did above and then look at the individuals' value on the x axis for a given feature, but it makes it hard to compare the output for all features (when they are plotted together, the x axis is $\tau$ so we can't easily determine where Bob and Mary land).
Instead, we can call dedicated methods:
explainer.plot_influence_comparison(
X_test=X_test[["age", "education-num", "hours-per-week", "gender"]],
y_pred=y_pred,
reference=bob,
compared=mary,
)
Here, we can see that, on average, people of Mary's age are 13% more likely to earn more than $50k a year than people of Bob's age (which is expected because Mary is older than Bob and, basically, the older the richer).
Unfortunately, we also see that people of Mary's gender (women) are about 19% less likely to earn more than $50k per year than people of Bob's gender (men).
We can do the same for the performance:
explainer.plot_performance_comparison(
X_test=X_test[["age", "education-num", "hours-per-week"]],
y_test=y_test,
y_pred=y_pred > 0.5,
metric=metrics.accuracy_score,
reference=bob,
compared=mary,
)