#

Attention with Linear Biases (ALiBi)

This is an implementation of Attention with Linear Biases (ALiBi) from the paper Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation (pdf).

This replaces positional encodings with biases added to attention scores (attention logits, before the softmax). This is a relative scheme tested on autoregressive tasks, and the bias is higher for closeby tokens and lower for far-away tokens. The biases decrease linearly in the log scale (because it's before the softmax) and each head has a different slope.

Here's the attention formula for $i$ -th token,

a_{i} = soft m ax (q_{i} K^{⊤} + m \cdot [- (i - 1), \dots, 1, 0]) = soft m ax (q_{i} K^{⊤} + m \cdot [0, 1, \dots, (i - 1)])

where $q_{i} \in R^{d}$ is the query of the $i$ -th token, $K \in R^{i \times d}$ are the keys up to $i$ , and $d$ the number of features per head. Note that the above equality halts because $soft m ax$ is invariant to translations (you can add any constant to all elements without changing the result).

Here is the training code for a ALiBi model.

36import math
37
38import torch
39from torch import nn
40
41from labml.logger import inspect
42from labml_nn.transformers.mha import MultiHeadAttention

#

Get head-specific slope $m$ for each head

n_heads is the number of heads in the attention layer $n$

The slope for first head is

$2^{- 2^{- (l o g_{2} n) - 3}}$

The slopes for the rest of the heads are in a geometric series with a ratio same as above.

For instance when the number of heads is $8$ the slopes are $\frac{1}{2 ^{1}}, \frac{1}{2 ^{2}}, \dots, \frac{1}{2 ^{8}}$

45def get_slopes(n_heads: int):

#

$2^{- 2^{- (l o g_{2} n) - 3}}$

62    s = (2 ** (-2 ** -(math.log2(n_heads) - 3)))

#

The geometric sequence

64    return [s * (s ** i) for i in range(n_heads)]

#

Calculate the attention biases matrix

n_heads is the number of heads in the attention layer
max_len is the maximum sequence length

This returns a matrix of shape [n_heads, max_len] with attention biases.

67def get_biases(n_heads: int, max_len: int):

#

Get slopes $m$ for each head

78    slopes = torch.tensor(get_slopes(n_heads))

#

Calculate distances $[0, 1, \dots, N]$

80    distance = torch.arange(max_len).to(torch.float)

#

Multiply them pair-wise to get the bias matrix

82    return distance[:, None] * slopes[None, :]

#