If you've touched anything in modern AI (ChatGPT, Claude, image generators with text encoders), you've used a transformer. It's the architecture sitting underneath nearly all of it. Here's the short version of what that actually means.
Definition
In deep learning, a transformer is a family of neural network architectures built around the multi-head attention mechanism [Source 1]. That's the whole core idea. Attention is the engine; everything else is plumbing around it.
How it processes text
The pipeline is pretty mechanical once you see it laid out:
Write for sansxel
Want your work in the Learn library? Apply for a hardlocked byline.
Tokenization. Text gets converted into numerical representations called tokens [Source 1].
Embedding. Each token is turned into a vector by looking it up in a word embedding table [Source 1].
Contextualization. At each layer, every token gets contextualized against the other (unmasked) tokens inside the context window, in parallel, using multi-head attention [Source 1].
The useful intuition for step 3: attention amplifies the signal from tokens that matter for the current token and damps down the ones that don't [Source 1]. So when the model is reading the word "it" in a sentence, attention is what lets it figure out which earlier noun "it" refers to.
Key terms
Token. The numerical unit a transformer actually consumes. Text in, tokens out, then vectors [Source 1].
Word embedding table. A lookup that maps each token to a vector [Source 1].
Context window. The span of tokens the model can attend to at once. Attention only operates over tokens inside this window [Source 1].
Multi-head attention. The parallel mechanism that lets each token weigh its relationship to every other unmasked token in the window [Source 1].
Masking. Some tokens are hidden from attention (for example, future tokens during autoregressive generation), which is why Source 1 specifies "unmasked" tokens [Source 1].
Layer. Transformers stack many of these attention blocks. Contextualization happens at each one [Source 1].
Why it caught on
Two properties do most of the work. Attention is computed in parallel across the context window, which fits modern GPUs well [Source 1]. And the mechanism of amplifying important tokens while diminishing the rest gives the model a flexible way to route information without a fixed sequential bottleneck [Source 1].
Where to go next
If you want to go deeper, the natural next stops are: how multi-head attention is computed (queries, keys, values), positional encoding (since attention itself is order-agnostic), and the encoder vs decoder vs decoder-only variants. Those build directly on the definition above, but they're outside the scope of this entry.