In this tutorial we will see how to customize the target function used by Inseq to compute attributions, to enable some interesting use cases of feature attribution methods.

Note

The Inseq library comes with a list of pre-defined step scores functions such as `probability` and `entropy`. By passing one or more score names when using `model.attribute`, these scores will be computed from model outputs and returned in the `step_scores` dictionary of the output objects. The list of all available scores is available as `inseq.list_step_functions`, and new scores can be added with `inseq.register_step_function`.

Besides providing useful statistics about model predictive distribution, step score functions are also used as targets when computing feature attributions. The default behavior of the library is to use next token probability (i.e. the `probability` step score) as the attribution target. This is a fairly standard practice, considering that most studies perform attributions using output logits as targets, and that the softmax transformation for going from logits to probabilities doesn’t affect the attribution scores.

Intuitively, scores produced by attributing next token’s probability answer the question “Which elements of the input sequence are the most relevant to produce the next generation step?”. High scores (both positive and negative, depending on the output range of the attribution method) for a generation step can then be interpreted as input values that heavily impact next token production.

While interesting, this question is not the only one that could be answered by gradient-based methods. For example, we might be interested in knowing why our model generated its output sequence rather than another one that we consider to be more likely. The paper “Interpreting Language Models with Contrastive Explanations” by Yin and Neubig (2022) suggest that such question can be answered by complementing the output probabilities with the ones from their contrastive counterpart, and using the difference between the two as attribution target.

We can define such attribution function using the standard template adopted by Inseq.

```from inseq.attr.step_functions import probability_fn

# Simplified implementation of inseq.attr.step_functions.contrast_prob_diff_fn
# Works only for encoder-decoder models!
def example_prob_diff_fn(
forward_output,
encoder_input_embeds,
decoder_input_ids,
target_ids,
# Extra arguments for our use case
contrast_ids,
# We use kwargs to collect unused default arguments
**kwargs,
):
"""Custom attribution function returning the difference between next step probability for
candidate generation vs. a contrastive alternative, answering the question "Which features
were salient in deciding to pick the selected token rather than its contrastive alternative?"

Extra args:
contrast_ids: Tensor containing the ids of the contrastive input to be compared to the
regular one.
"""
# We truncate contrastive ids and their attention map to the current generation step
)
# We select the next contrastive token as target
# Forward pass with the same model used for the main generation, but using contrastive inputs instead
inputs_embeds=encoder_input_embeds,
decoder_input_ids=contrast_decoder_input_ids,
)
# Return the prob difference as target for attribution
return model_probs - contrast_probs
```

Besides common arguments such as the attribution model, its outputs after the forward pass and all the input ids and attention masks required by 🤗 Transformers, we provide contrastive ids and their attention mask in input to compute the difference between original and contrastive probabilities. The output of the function is what is used to compute the gradients with respect to the input.

Now that we have our custom attribution function, integrating it in Inseq is very easy:

```import inseq
from inseq.data.aggregator import AggregatorPipeline

# Register the function defined above
# Since outputs are still probabilities, contiguous tokens can still be aggregated using product
inseq.register_step_function(
fn=example_prob_diff_fn,
identifier="example_prob_diff",
aggregate_map={"span_aggregate": lambda x: x.prod(dim=1, keepdim=True)},
)

# Pre-compute ids and attention map for the contrastive target
contrast = attribution_model.encode("Ho salutato la manager", as_targets=True)

# Regular (forced) target -> "Non posso crederci."
# Contrastive target      -> "Non posso crederlo."
# contrast_ids & contrast_attention_mask are kwargs defined in the function definition
"I said hi to the manager",
"Ho salutato il manager",
attributed_fn="example_prob_diff",
contrast_ids=contrast.input_ids,
attribute_target=True,
# We also visualize the step score
step_scores=["example_prob_diff"]
)

# Weight attribution scores by the difference in logits
The contrastive attribution function showcased above is already registered in Inseq under the name `contrast_prob_diff`, give it a try!
The `aggregate_map` argument is useful to inform the library about which functions should be used when aggregating step scores (not attributions!) using `Aggregator` classes. In this example, we specify that when aggregating over multiple tokens using the `ContiguousSpanAggregator`, we can simply take the product of the computed probability difference as their aggregated score.