How It Works¶
In this notebook, we explain the maths behind ethik
and how it is implemented. It is not mandatory to start using the package. To so so, please read the Getting started tutorial.
Let's assume you have a dataset and a model that generates predictions on this dataset. For the sake of the example, we'll be wortking with the Adult dataset.
import ethik
import numpy as np
import pandas as pd
import plotly.graph_objs as go
from sklearn import metrics, model_selection
X, y = ethik.datasets.load_adult()
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, shuffle=True, random_state=42)
X_test.head()
ethik
is model-agnostic and works with the data only.
def get_black_box_model_predictions():
import lightgbm as lgb
model = lgb.LGBMClassifier(random_state=42).fit(X_train, y_train)
return model.predict_proba(X_test)[:, 1]
y_pred = get_black_box_model_predictions()
# We use a named pandas series to make plot labels more explicit
y_pred = pd.Series(y_pred, name=">$50k")
y_pred
ethik
lets us explore how each feature impact the model's behavior, either its prediction or its performance.
The Maths¶
Mathematically, we have a dataset of $n$ samples and $p$ features:
$$ \begin{aligned} X &= \{(x_i^1, x_i^2, \ldots, x_i^p) \}_{1 \leq i \leq n} \\ &= \{x_i\}_{1 \leq i \leq n} \end{aligned} $$For data scientists, $X$ is X_test
.
We also have the model's predictions:
$$ \begin{aligned} \hat{Y} &= \{f(x_i)\}_{1 \leq i \leq n} \\ &= \{y_i\}_{1 \leq i \leq n} \end{aligned} $$We potentially also have $Y$, the true output observed from the real world.
The dataset $(X, \hat{Y}, Y)$ can be considered as an empirical probability distribution $Q_n$ (the unobserved true distribution being $Q$). As explained in the paper, the key idea of ethik
is to stress this distribution with respect to a property of $X$ to reach a target $t$ on this property.
For now, we only handle targets on the mean, i.e. we can stress $Q_n$ so that the mean of $X$ is $\mu_t = (\mu_t^1, \mu_t^2, \ldots, \mu_t^p)$. Then we obtain a new distribution $Q_t$.
But of course, a lot of distributions can have the mean $\mu_t$, so the target is not sufficient to determine the stress. Because the model was trained on $Q_n$ (we suppose that the training set and the test set have the same distribution), we want $Q_t$ to be as close as possible to $Q_n$. We do this by minimizing the Kullback-Leibler divergence between $Q_n$ and $Q_t$.
Let's take an example. First, let's build a toy dataset with two features $X_{age}$ and $X_{education-num}$:
n = len(X_test)
ds = pd.DataFrame(
[
[X_test["age"].iloc[i], X_test["education-num"].iloc[i], y_pred[i], int(y_test.iloc[i])]
for i in range(n)
],
columns=["age", "education-num", "ŷ", "y"]
)
ds
Let's plot the marginal distribution of $X_{age}$:
def plot_pdf(x, density):
width = x[1:] - x[:-1]
mean = sum(x * density) / sum(density)
return go.Figure().add_bar(
x=x,
y=density,
width=width,
).update_layout(
plot_bgcolor="#FFF",
xaxis=dict(
title="age",
linecolor="black",
),
yaxis=dict(
title="Density",
linecolor="black",
),
shapes=[
go.layout.Shape(
type="line",
x0=mean,
y0=0,
x1=mean,
y1=1,
yref="paper",
line=dict(
color="red",
width=1,
dash="dash",
)
)
],
annotations=[
go.layout.Annotation(
x=mean,
y=1,
xref="x",
yref="paper",
text=f"Mean = {mean:.2f}",
showarrow=False,
yanchor="middle",
bgcolor="#FFF",
bordercolor="red",
borderpad=4,
)
]
)
densities, edges = np.histogram(ds["age"], bins=40, density=True)
x_pdf = edges[:-1]
plot_pdf(x_pdf, densities)
Like with counterfactual explanations, we now ask the question:
What if the mean age was $\mu_t$?
For instance, we can stress $Q_n$ so that $\mu_t \approx 30$:
target = 30
weights = [2 if age <= target else 0.5 for age in ds["age"]]
stressed_densities, _ = np.histogram(ds["age"], bins=40, density=True, weights=weights)
plot_pdf(x_pdf, stressed_densities)
We could sample from this distribution and observe how the model behaves for a mean age of 20. It'd be similar to Partial Dependence Plots.
The problem here is that we only stressed the marginal distribution of $X_{age}$. But features may be correlated and if we don't consider that, we are likely to create unrealistic individuals (a 18 year-old who has a PhD).
To address this issue, the key idea behind the paper is to stress the whole distribution $(X, \hat{Y}, Y)$ while minimizing the distance to the original distribution. The Kullback–Leibler divergence is used.
Minimizing the Kullback–Leibler divergence between the stressed distribution and the original distribution lets us believe that we are not creating individuals that are too unrealistic. Of course, reaching a minimum doesn't mean that this minimum is actually small and, to certify the model, we need to define a threshold on the distance. This is out of the scope of this notebook though.
As explained in the paper, we can efficiently compute a unique set of weights $\{\lambda_i^{(t)}\}_{1 \leq i \leq n}$ to stress the probability density function to reach a target on a property while minimizing the KL divergence:
$$ Q_t = \sum_{i=1}^n \lambda_i^{(t)} \delta_{x_i, \hat{y}_i, y_i}. $$with:
- $Q_t$ being the distribution meeting the target $t$ (e.g. $E[X_{age}] = \mu_t$) and minimizing $KL(Q_t, Q_n)$ ;
- $\delta_{x_i, \hat{y}_i, y_i}$ being the probability density of the $i$-th sample.
Remember: $Q_t$ includes the input $X$ but also the output $\hat{Y}$ (and potentially the true output $Y$). It means that we can compute what would the output be if $X$ met the target $t$ without having to run the model again. This is a huge gain in performance!
Let's notice that we do not change the individuals of the dataset but their probability density. In other words, to increase the mean of a die, we do not replace the 6 by a 7 but we trick the die to increase the probability of rolling a 6.
The theorem that tells us how to compute these weights works for targets on the mean of $X$. By transforming $X$, we can also reach targets on higher-order moments (e.g. the variance). For instance, the mean of the feature $(X - E[X])^2$ is actually the variance of $X$. In the paper, the transformation of $X$ is called $\Phi(X)$ (e.g. $\Phi : X \mapsto (X - E[X])^2$).
So it enables us to answer questions like: What would the output distribution be if the mean of...
- this feature were $\mu_t$? (we only consider one feature, i.e. its marginal distribution)
- these features were $(\mu_t^1, \mu_t^2)$? (we consider multiple features and so their potential correlations)
- $(X^{j_0} - E[X^{j_0}])^2$ (i.e. the variance) were $\sigma_t^2$? (we consider the variance of a feature instead of its mean)
- ...
And so without having to run the model again.
Implementation¶
Since we have a finite number of data samples, we can consider our probability distribution as discrete. Stressing the distribution means applying weights to the density of every individual:
$$ Q_t = \sum_{i=1}^n \lambda_i^{(t)} \delta_{x_i, \hat{y}_i, y_i}. $$Let's notice that the $x_i$ may have been transformed by $\Phi$, as illustrated above.
As explained in the paper, these weights $\lambda_i^{(t)}$ are derived from a common parameter $\xi_t$:
$$ \begin{aligned} \lambda_i^{(t)} &= \frac{\exp(\langle \xi_t, x_i \rangle)}{\sum_{i=1}^n \exp(\langle \xi_t, x_i \rangle)} \\ &= \text{softmax}(\langle \xi_t, x_i \rangle) \end{aligned} $$with $\langle x, y \rangle$ being the scalar product. $\lambda_i^{(t)}$ is then a scalar.
Then the mean of the stressed distribution is given by:
$$ \mu_t = \frac{1}{\sum_{i=1}^n \lambda_i^{(t)}} \sum_{i=1}^n \lambda_i^{(t)} \times x_i $$To find $\xi_t$ algorithmically, we minimize the function (target_mean - current_mean) ** 2
, with current_mean
being computed with the previous formulas. The code is in ethik.base_explainer.compute_ksi()
.
Explaining a black-box model with a black-box is not a big deal. Let's dive into the machinery:
First, we can visualize the stressed distribution in ethik
:
ethik.ClassificationExplainer().plot_distributions(
X_test["age"],
targets=[25, 45]
)
The distribution for a mean age of 25 doesn't look much like the original distribution. It means that some individuals get a lot more weights than in the data. We can visualize how many individuals capture X% of the weight (i.e. the sum of the $\lambda_i^{(t)}$):
ethik.ClassificationExplainer().plot_cumulative_weights(
X_test["age"],
targets=[25, 45]
)
To plot this chart, we:
- Compute the weights $\lambda_i^{(t)}$ ;
- Sort them decreasingly (the highest first) ;
- Plot them cumulatively.
A straight line means that the weight is uniformely distributed on the individuals. Here, we can see that for a mean of 25, 50% of the weight is captured by 14% of the individuals. Whether this is too small or not is out of the scope of this notebook.
Let's notice that we only stress one feature here, so the potential correlations with other features are not considered. The algorithm just makes sure that the stressed (marginal) distribution is the one that is the closest to the original one (according to a KL distance).
Data structures¶
To explain how we manipulate the data, we'll use the internal class BaseExplainer
:
base_explainer = ethik.base_explainer.BaseExplainer()
First, we need to build a query, that is a pandas dataframe that contains at least four columns to answer the question: What would the output label
be this feature
were equal to target
?
feature
: It must match the name of a column in the dataset.target
: This value must satisfyfeature.min() < target < feature.max()
.label
: This value is used for multi-class classification problems. For binary classification and regression, it must just match the name of the pandas series containing the predictions.group
: A key to determine the targets that must be reached together. For instance, if the rows("age", 30, "output")
and("education-num", 10, "output")
have the same group, it means that we will consider the distribution $(X_{age}, X_{education-num})$ and try to reach the mean $(30, 10)$. It differs from considering the marginal distributions $X_{age}$ and $X_{education-num}$ independently.
For instance:
query = pd.DataFrame({
"group": [0, 1],
"feature": ["age", "age"],
"target": [30, 45],
"label": [y_pred.name, y_pred.name],
})
query
To reach multiple targets at the same time, we specify a common group. Of course, a group cannot contain more than one row per feature. All groups don't need to have the same number of features.
pd.DataFrame({
"group": [0, 0, 1],
"feature": ["age", "education-num", "age"],
"target": [30, 10, 45],
"label": [y_pred.name, y_pred.name, y_pred.name],
})
To compute the weights used for the stress, the method _fill_ksis()
is called internally:
base_explainer._fill_ksis(X_test, query)
_fill_ksis()
alters its input:
query
In practice, we directly call an _explain_*()
method. Let's initialize the query again to be in that case:
query = pd.DataFrame({
"group": [0, 1],
"feature": ["age", "age"],
"target": [30, 45],
"label": [y_pred.name, y_pred.name],
})
query
base_explainer._explain_influence(
X_test=X_test,
y_pred=y_pred,
query=query
)
The original query has not been altered:
query
As seen before, ksi
is the parameter used to compute the weights. influence
is the mean of the output. We can read the dataframe above this way: When the mean age is 30, the average probability of earning more than $50k a year is about 15%.
influence_low
and influence_high
are for the confidence interval. As we didn't specify n_samples
when we created the explainer, their value is equal to the mean. Otherwise, we follow this procedure:
- Sub-sample the dataset (e.g. take 80% of the samples)
- Add the smallest and the highest values of the dataset (so that we can reach extreme targets)
- Compute the feature influence
- Do that
n
times - Get the mean of the results and the 5% and 95% quantiles (defined by
explainer.conf_level
) for the confidence interval
ethik.base_explainer.BaseExplainer(n_samples=10)._explain_influence(
X_test=X_test,
y_pred=y_pred,
query=query
)
We can also ask the question: What is the average performance when the mean age is 30 or 45?
base_explainer._explain_performance(
X_test=X_test,
y_test=y_test,
y_pred=y_pred > 0.5,
metric=metrics.accuracy_score,
query=query
)
Query building¶
In the public API, we expose classes that inherit from CacheExplainer
(which inherits from BaseExplainer
itself). This class does three things:
- It builds the query automatically using
ethik.query.Query()
; - It calls the methods of
BaseExplainer
to explain the model ; - It stores the results in an attribute (we'll see that in detail later).
To build the query, it uses the quantiles:
ethik.query.Query.from_taus(
X_test=X_test[["age"]],
labels=[y_pred.name],
n_taus=7,
q=[0.05, 0.95],
)
It doesn't make sense to target a mean that is too close to the bounds of the feature because it would lead to a distribution with a density focused on the few smallest/largest individuals only, which is quite different from the original distribution the model was trained with.
To avoid that, we keep the targets between bounds that are controlled by the alpha
parameter. By default, alpha = 0.05
, which means that the targets are between the 5% and the 95% quantiles.
In the query above, tau == 0
represents the original mean, tau == -1
the 5% quantile and tau == 1
the 95% quantile. As the data may not be evenly distributed, a regular step for tau
doesn't mean a regular step for target
.
The group id is built using a hash to easily retrieve an existing record (see below).
The query is built internally when we call explain_*()
:
cache_explainer = ethik.cache_explainer.CacheExplainer()
cache_explainer.explain_influence(
X_test=X_test["age"],
y_pred=y_pred,
).head()
Caching¶
CacheExplainer
stores the results in its .info
attribute:
cache_explainer.explain_influence(
X_test=X_test["age"],
y_pred=y_pred,
)
cache_explainer.info.head()
By default, this dataframe is reset at each call, as this is the less-magical behaviour:
cache_explainer.explain_influence(
X_test=X_test["education-num"],
y_pred=y_pred,
)
cache_explainer.info.head()
But we can instantiate the explainer with memoize=True
:
cache_explainer = ethik.cache_explainer.CacheExplainer(memoize=True)
cache_explainer.explain_influence(
X_test=X_test["age"],
y_pred=y_pred,
)
cache_explainer.info.head()
cache_explainer.explain_influence(
X_test=X_test["education-num"],
y_pred=y_pred,
)
cache_explainer.info
This way, we don't have to compute the ksis twice when we call explain_influence()
and explain_performance()
:
cache_explainer.explain_performance(
X_test=X_test["age"],
y_test=y_test,
y_pred=y_pred > 0.5,
metric=metrics.accuracy_score,
)
cache_explainer.info
What characterizes $\xi$ is the list of (feature, target)
. We hash it to get the group id and be able to rapidly identify the parts of the query that have already been computed during a previous call.