Attributing Distributed LLMs with Petals
What is Petals?
Petals is a framework enabling large language models usage without
the need of high-end GPUs, exploiting the potential of distributed training and inference. With Petals, you can join
compute resources with other people over the Internet and run large language models such as LLaMA, Guanaco, or BLOOM
right from your desktop computer or Google Colab. See the official tutorial and the paper showcasing
petals
for more details.
Since petals
allows for gradient computations to take place on multiple machines and is mostly compatible with the
Huggingface Transformers library, it can be used alongsides inseq
to attribute large LLMs such as LLaMA 65B or
Bloom 175B. This tutorial will show how to load a LLM from petals
and use it to attribute a generated sequence.
Attributing LLMs with Petals
First, we need to install petals
and inseq
with pip install inseq petals
. Then, we can load a LLM from
petals
and attribute it with inseq
. For this tutorial, we will use the LLaMA 65B model, which can be loaded as
follows:
from petals import AutoDistributedModelForCausalLM
model_name = "enoch/llama-65b-hf"
model = AutoDistributedModelForCausalLM.from_pretrained(model_name).cuda()
We can now test a prompt of interest to see whether the model would provide the correct response:
from transformers import AutoTokenizer
prompt = (
"Option 1: Take a 50 minute bus, then a half hour train, and finally a 10 minute bike ride.\n"
"Option 2: Take a 10 minute bus, then an hour train, and finally a 30 minute bike ride.\n"
"Which of the options above is faster to get to work?\n"
"Answer: Option "
)
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False, add_bos_token=False)
inputs = tokenizer(prompt, return_tensors="pt")["input_ids"].cuda()
# Only 1 token is generated
outputs = model.generate(inputs, max_new_tokens=1)
print(tokenizer.decode(outputs[0]))
#>>> [...] The answer is Option 1
We can see that the model correctly predicts Option 1 to be the shortest option. Now, we can use inseq
to attribute
the modelβs prediction to understand which features played a relevant role in determining the modelβs answer.
Exploiting the advanced features of the inseq
library, we can easily perform a contrastive attribution using
contrast_prob_diff_fn()
between 1 and 2 as target for gradient attribution (see our
tutorial for more details).
out = inseq_model.attribute(
prompt,
prompt + "1",
attributed_fn="contrast_prob_diff",
contrast_targets=prompt + "2",
step_scores=["contrast_prob_diff", "probability"],
)
# Attributing with input_x_gradient...: 100%|ββββββββββ| 80/80 [00:37<00:00, 37.55s/it]
Our attribution results are now stored in the out
variable, and have exactly the same format as the ones obtained
with any other Huggingface decoder-only model. We can now visualize the attribution results using the
show()
method, specifying the aggregation of our choice. Here we will use the sum
of input_x_gradient
scores across all 8192 dimensions of model input embeddings, without normalizing them to sum to
1:
out.show(aggregator="sum", normalize=False)
From the results we can observe that the model is generally upweighting minutes
tokens, while attribution scores
for hour
are less clear-cut. We can also observe that the model predicts Option 1 with a probability of ~53%
(probability
), which is roughly 8% higher than the contrastive option 2 (contrast_prob_diff
). In light of this,
we could formulate the hypothesis that attributions are not very informative because of the relatively low confidence
of the model in its prediction.
Warning
While most methods relying on prediction should work normally with petals
, methods requiring access to model
internals such as attention
are not currently supported.