christophm.github.io/interpretable-ml-book/shap.html

4 Users

0 Comments

24 Highlights

0 Notes

Tags

Top Highlights

The basic idea is to push all possible subsets S down the tree at the same time.

Features are often on different scales.

All SHAP values have the same unit – the unit of the prediction space.

Small coalitions (few 1’s) and large coalitions (i.e. many 1’s) get the largest weights. The intuition behind it is: We learn most about individual features if we can study their effects in isolation. If a coalition consists of a single feature, we can learn about this feature’s isolated main effect on the prediction.

SHAP kernel

Lundberg and Lee show that linear regression with this kernel weight yields Shapley values.

For x, the instance of interest, the coalition vector x’ is a vector of all 1’s, i.e. all feature values are “present”.

The problem is that we have to apply this procedure for each possible subset S of the feature values.

With SHAP, global interpretations are consistent with the local explanations, since the Shapley values are the “atomic unit” of the global interpretations.

Features with large absolute Shapley values are important

If you use LIME for local explanations and partial dependence plots plus permutation feature importance for global explanations, you lack a common foundation.

One innovation that SHAP brings to the table is that the Shapley value explanation is represented as an additive feature attribution method, a linear model

Sampling from the marginal distribution means ignoring the dependence structure between present and absent features

KernelSHAP therefore suffers from the same problem as all permutation-based interpretation methods. The estimation puts too much weight on unlikely instances. Results can become unreliable. But it is necessary to sample from the marginal distribution. The solution would be to sample from the conditional distribution, which changes the value function, and therefore the game to which Shapley values are the solution

For example, we can add regularization terms to make the model sparse. If we add an L1 penalty to the loss L, we can create sparse explanations. (I am not so sure whether the resulting coefficients would still be valid Shapley values though.)

The non-zero estimate can happen when the feature is correlated with another feature that actually has an influence on the prediction.

The interaction effect is the additional combined feature effect after accounting for the individual feature effects. The Shapley interaction index from game theory is defined as

This formula subtracts the main effect of the features so that we get the pure interaction effect after accounting for the individual effects. We average the values over all possible feature coalitions S, as in the Shapley value computation. When we compute SHAP interaction values for all features, we get one matrix per instance with dimensions M x M, where M is the number of features.

SHAP authors proposed KernelSHAP, an alternative, kernel-based estimation approach for Shapley values inspired by local surrogate models

TreeSHAP

Glasp is a social web highlighter that people can highlight and organize quotes and thoughts from the web, and access other like-minded people’s learning.