The induction head in a 2-layer attention-only transformer model has a slight bias towards tokens later in the context compared to earlier. Interestingly, its notion of position appears to not depend on positional embeddings, or any specific output from an attention head in the previous layer.

'Recency bias' in an induction head

In a 2-layer attention-only transformer model, an induction head can combine with an "averaging" head that stores some kind of average over the previous ~4-5 tokens to produce a circuit that can predict the next token in repeated sequences of length 2 to 5.

Induction head circuits for longer sequences

A few plots on previous token heads, a discussion of how they work and a comparison to a similar type of attention head -- a "look-back-two" head.

The previous token head and the "look-back-two" head

The position embeddings in a 2-layer attention-only transformer model arrange themselves into a helical structure. This presumably allows the model to generate QK matrices to move a few positions in relative terms with a similar transformation for all positions. The positional embeddings at positions 0 and 1023 have special properties.

Posts tagged 2-layer-transformer

'Recency bias' in an induction head

Induction head circuits for longer sequences

The previous token head and the "look-back-two" head

Positional Embeddings in a 2-layer attention-only transformer model