<aside> ℹ️
I've drafted a quick, simple blog post on transformers. This post is part 1 of a series I'm putting together. It will give you a rundown of what transformers are and explain input handling—specifically, tokenization and positional encoding.
Please note that I wrote this quickly so that it may contain errors and typos. Feel free to drop me a line if you find any. Thanks!
</aside>
A transformer is a neural network architecture for sequence-to-sequence and sequence-to-label tasks (e.g., translation, summarization, classification, etc), which replaces recurrence or convolution with an attention mechanism.
While typical RNNs process tokens strictly from left to right with a hidden state, a Transformer allows each token to directly look at (i.e., attend to) other tokens and build a context-aware representation.
A transformer architecture consists of:
Each stack has two big sub-blocks:
Therefore, self-attention allows interaction among tokens (n-to-n), thereby providing contextualization. Then, the MLP refines that information per token.
On top of these two sub-blocks, there are two important parts: Residual connections (skip connections) and layer normalization.
Transformer never deals with characters directly. It only sees token IDs. Text to tokens (words/subwords/bytes).
<aside>
"unbelievable" → ["un", "believ", "able"] → [417, 9821, 211]
</aside>
Each token ID from the previous step (tokenization) is mapped to a dense vector using an embedding matrix.