Custom Attribution Targets for Contrastive Attribution

In this tutorial we will see how to customize the target function used by Inseq to compute attributions, to enable some interesting use cases of feature attribution methods.

Note

The Inseq library comes with a list of pre-defined step scores functions such as probability and entropy. By passing one or more score names when using model.attribute, these scores will be computed from model outputs and returned in the step_scores dictionary of the output objects. The list of all available scores is available as inseq.list_step_functions, and new scores can be added with inseq.register_step_function.

Besides providing useful statistics about model predictive distribution, step score functions are also used as targets when computing feature attributions. The default behavior of the library is to use next token probability (i.e. the probability step score) as the attribution target. This is a fairly standard practice, considering that most studies perform attributions using output logits as targets, and that the softmax transformation for going from logits to probabilities doesn’t affect the attribution scores.

Intuitively, scores produced by attributing next token’s probability answer the question “Which elements of the input sequence are the most relevant to produce the next generation step?”. High scores (both positive and negative, depending on the output range of the attribution method) for a generation step can then be interpreted as input values that heavily impact next token production.

While interesting, this question is not the only one that could be answered by gradient-based methods. For example, we might be interested in knowing why our model generated its output sequence rather than another one that we consider to be more likely. The paper “Interpreting Language Models with Contrastive Explanations” by Yin and Neubig (2022) suggest that such question can be answered by complementing the output probabilities with the ones from their contrastive counterpart, and using the difference between the two as attribution target.

We can define such attribution function using the standard template adopted by Inseq. The StepFunctionDecoderOnlyArgs and StepFunctionEncoderDecoderArgs classes are used for convenience to encapsulate all default arguments passed to step functions, namely:

  • attribution_model: the attribution model used to compute attributions.

  • forward_output: the output of the forward pass of the attribution model.

  • target_ids: the ids corresponding to the next predicted tokens for the current generation step.

  • ids, embeddings and attention mask corresponding to the model input at the present step, including inputs for the encoder in case of encoder-decoder models.

from inseq.attr.step_functions import probability_fn, StepFunctionArgs

# Simplified implementation of inseq.attr.step_functions.contrast_prob_diff_fn
# Works only for encoder-decoder models!
def example_prob_diff_fn(
    # Default arguments for all step functions
    args: StepFunctionArgs,
    # Extra arguments for our use case
    contrast_ids,
    contrast_attention_mask,
):
    """Custom attribution function returning the difference between next step probability for
    candidate generation vs. a contrastive alternative, answering the question "Which features
    were salient in deciding to pick the selected token rather than its contrastive alternative?"

    Extra args:
        contrast_ids: Tensor containing the ids of the contrastive input to be compared to the
            regular one.
        contrast_attention_mask: Tensor containing the attention mask for the contrastive input
    """
    # We truncate contrastive ids and their attention map to the current generation step
    device = args.attribution_model.device
    len_inputs = args.decoder_input_ids.shape[1]
    contrast_decoder_input_ids = contrast_ids[:, : len_inputs].to(device)
    contrast_decoder_attention_mask = contrast_attention_mask[:, : len_inputs].to(device)
    # We select the next contrastive token as target
    contrast_target_ids = contrast_ids[:, len_inputs].to(device)
    # Forward pass with the same model used for the main generation, but using contrastive inputs instead
    contrast_output = args.attribution_model.model(
        inputs_embeds=args.encoder_input_embeds,
        attention_mask=args.encoder_attention_mask,
        decoder_input_ids=contrast_decoder_input_ids,
        decoder_attention_mask=contrast_decoder_attention_mask,
    )
    # Return the prob difference as target for attribution
    model_probs = probability_fn(args)
    args.forward_output = contrast_output
    args.target_ids = contrast_target_ids
    contrast_probs = probability_fn(args)
    return model_probs - contrast_probs

Besides common arguments such as the attribution model, its outputs after the forward pass and all the input ids and attention masks required by 🤗 Transformers, we provide contrastive ids and their attention mask in input to compute the difference between original and contrastive probabilities. The output of the function is what is used to compute the gradients with respect to the input.

Now that we have our custom attribution function, integrating it in Inseq is very easy:

import inseq

# Register the function defined above
# Since outputs are still probabilities, contiguous tokens can still be aggregated using product
inseq.register_step_function(
    fn=example_prob_diff_fn,
    identifier="example_prob_diff",
    aggregate_map={"span_aggregate": lambda x: x.prod(dim=1, keepdim=True)},
)

attribution_model = inseq.load_model("Helsinki-NLP/opus-mt-en-it", "saliency")

# Pre-compute ids and attention map for the contrastive target
contrast = attribution_model.encode("Ho salutato la manager", as_targets=True)

# Perform the contrastive attribution:
# Regular (forced) target -> "Non posso crederci."
# Contrastive target      -> "Non posso crederlo."
# contrast_ids & contrast_attention_mask are kwargs defined in the function definition
out = attribution_model.attribute(
    "I said hi to the manager",
    "Ho salutato il manager",
    attributed_fn="example_prob_diff",
    contrast_ids=contrast.input_ids,
    contrast_attention_mask=contrast.attention_mask,
    attribute_target=True,
    # We also visualize the step score
    step_scores=["example_prob_diff"]
)

# Weight attribution scores by the difference in logits
out.weight_attributions("example_prob_diff")
out.show()

From this example, we see that the masculine Italian determiner “il” is 70% more likely than its feminine counterpart “la” before “manager”, and that the model is mostly influenced by the word manager itself. A textbook example of gender bias in machine translation! We can also see how the divergence between the two generations has almost no impact on following tokens, if we weight them by the difference in log probabilities.

The contrastive attribution function showcased above is already registered in Inseq under the name contrast_prob_diff, give it a try!

Note

The aggregate_map argument is useful to inform the library about which functions should be used when aggregating step scores (not attributions!) using Aggregator classes. In this example, we specify that when aggregating over multiple tokens using the ContiguousSpanAggregator, we can simply take the product of the computed probability difference as their aggregated score.