LLMs Are Math

“AI feels magical, until you realize it’s mostly linear algebra.”

I learnt that when people interact LLMs, it can feel like intelligence: understanding, creativity, reasoning.

But under the hood?

It’s math!

Not magic. Not consciousness. Not a digital brain.

Just math — and beautiful math at that.

1) Everything starts with vectors⌗

LLMs don’t “understand” words the way humans do. They convert text into vectors — lists of numbers.

Toy embedding space (2D projection)

For example:

"king"  -> [0.21, -0.84, 1.33, ..., 0.02]
"queen" -> [0.25, -0.79, 1.40, ..., 0.04]

Each token becomes a point in a high-dimensional space (often hundreds or thousands of dimensions). The wild part is that meaning becomes geometry. Relationships show up as vector arithmetic:

$$ \text{king} - \text{man} + \text{woman} \approx \text{queen} $$

That’s linear algebra working in semantic space.⌗

2) Matrices are the real workhorses⌗

If vectors are points, matrices are transformations.

Here’s a visual intuition: a matrix transforms a grid.

Original grid

Grid after linear transform (matrix W)

A neural network layer is often described as:

$$ y = xW + b $$

Where:

$x$ is an input vector
$W$ is a weight matrix (millions or billions of learned numbers)
$b$ is a bias vector
$y$ is the transformed output

When people say:

“This model has 70 billion parameters.”

They mean:

“There are 70 billion numbers in matrices (and vectors) inside the model.”

Training is “just” learning those numbers.

3) Attention is still just math (dot-products + softmax)⌗

Modern LLMs are based on the Transformer architecture (introduced in 2017). The key idea is attention: the model computes how strongly each token should relate to every other token.

Toy attention heatmap

The core formula (scaled dot-product attention) is:

$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right)V $$

What that means in plain English:

compute similarities via dot-products ($QK^T$)
scale them ($\sqrt{d}$) so things don’t blow up
normalize into probabilities with softmax
mix the values ($V$) using those probabilities

It’s matrix multiplication, normalization, and more multiplication.

Math all the way down.

4) LLMs predict the next token (probability, not certainty)⌗

At its core, an LLM is a probability machine.

Given a prompt like:

“The sky is”

it produces a probability distribution over the next token.

Softmax probabilities

The final layer uses softmax to convert scores (“logits”) into probabilities:

$$ p_i = \frac{e^{z_i}}{\sum_j e^{z_j}} $$

Training minimizes cross-entropy loss — a way to measure how wrong the predicted distribution is compared to the true next token:

$$ \mathcal{L} = -\sum_i y_i \log(p_i) $$

Then optimization (gradient descent) adjusts parameters to reduce that loss.

Which is… calculus and optimization.

5) So where does “intelligence” come from?⌗

There is no single place in the model that contains:

grammar rules
facts about Spain
knowledge about GPUs
a hard-coded reasoning engine

Instead, those behaviors emerge from:

linear algebra (vectors + matrices)
non-linear functions
probability distributions
gradient-based optimization
scale (lots of data + lots of parameters)

That’s the surprising part:

Not that it “thinks” like us — but that math at scale can produce behavior that feels like thinking.

6) Why this matters if you’re learning AI⌗

It’s easy to feel overwhelmed by buzzwords:

Transformers
RLHF
fine-tuning
agents
multimodal models

But the foundation is compact:

Linear algebra
Probability
Calculus
Optimization

If you understand:

what vectors represent
what matrix multiplication does
what a derivative tells you
what a probability distribution means

you understand a huge chunk of modern AI.

The rest is mostly engineering choices and scale.

7) LLMs are math⌗

There’s something empowering about this:

AI isn’t mystical.
It isn’t unreachable.
It isn’t reserved for a “priesthood”.

It’s math.

And math is learnable.

The next time you see a model produce a surprisingly elegant answer, remember:

Behind those words is a giant pile of matrices multiplying vectors at insane speed.

And somehow… that’s enough.