Transformers – Input Handling

<aside> ℹ️

I've drafted a quick, simple blog post on transformers. This post is part 1 of a series I'm putting together. It will give you a rundown of what transformers are and explain input handling—specifically, tokenization and positional encoding.

Please note that I wrote this quickly so that it may contain errors and typos. Feel free to drop me a line if you find any. Thanks!

</aside>

🦂 What are Transformers?

A transformer is a neural network architecture for sequence-to-sequence and sequence-to-label tasks (e.g., translation, summarization, classification, etc), which replaces recurrence or convolution with an attention mechanism.

What is the core idea in Transformers?

While typical RNNs process tokens strictly from left to right with a hidden state, a Transformer allows each token to directly look at (i.e., attend to) other tokens and build a context-aware representation.

🚂 Architecture

A transformer architecture consists of:

Encoder stack (understanding input)
Decoder stack (generating output)

Each stack has two big sub-blocks:

(Multi-head) self-attention
- tokens exchange information (“who should I pay attention to?”)
(Position-wise) feed-forward neural network (FFN/MLP)
- Each token is processed independently by the same small neural net (“now that I have context, transform me”)

Therefore, self-attention allows interaction among tokens (n-to-n), thereby providing contextualization. Then, the MLP refines that information per token.

On top of these two sub-blocks, there are two important parts: Residual connections (skip connections) and layer normalization.

🪓 Tokenization

Transformer never deals with characters directly. It only sees token IDs. Text to tokens (words/subwords/bytes).

<aside>

"unbelievable" → ["un", "believ", "able"] → [417, 9821, 211]

</aside>

📐 Input/Token Embeddings

Each token ID from the previous step (tokenization) is mapped to a dense vector using an embedding matrix.