A 34-minute working session on what neural networks actually are, how they learn, why they overfit, the three architecture families worth knowing, and the tooling you'd reach for today — with honest notes on when plain machine learning still beats a deep net.
This deck assumes the basics from Machine Learning Fundamentals — features, labels, training vs. test, over/underfitting. Deep learning is one branch of ML: it stacks many simple layers so the model discovers its own features from raw data, instead of you engineering them by hand.
Classical ML leans on a human to engineer features; a deep network learns its own features straight from the raw input.
The honest default: try the simple model first. A deep net is rarely the cheapest path to a working baseline.
The whole edifice is built from one tiny operation repeated millions of times: multiply each input by a weight, add them up with a bias, then pass the result through a non-linear function. Stack these into layers and you get a network.
y = activation(w·x + b). The weights w say how much each input matters, the bias b shifts the result, and the activation adds the non-linearity that lets stacked layers learn curves, not just straight lines.Inputs are scaled by weights, summed with a bias, then squashed by an activation — that one unit, repeated, is the whole network.
A feed-forward network: every unit in one layer feeds every unit in the next. Each connection has its own weight to learn.
Without an activation, stacking layers collapses into a single linear step — no matter how deep. The activation is what lets a network bend.
max(0, z). Cheap, trains fast, the standard choice for hidden layers. Variants (LeakyReLU, GELU) fix edge cases.
Maps any number into (0, 1). Used for a single yes/no output. Saturates at the ends, so avoid it deep inside.
Zero-centered cousin of sigmoid. Common inside older recurrent networks; mostly displaced by ReLU in feed-forward layers.
Turns a vector of scores into a probability distribution that sums to 1 — the standard final layer for multi-class classification.
Training is a loop. The network makes a prediction, a loss function scores how wrong it was, backpropagation works out how each weight contributed to that error, and gradient descent nudges every weight a little in the direction that reduces it. Repeat for many passes over the data.
One training step: predict, score the error, propagate it back to every weight, then update. The loop runs thousands of times.
Each step rolls the weights a little further down the loss curve. Too big a learning rate overshoots; too small crawls.
Like hiking down into a valley in fog — you can't see the bottom, so you keep stepping in whatever direction goes downhill.
The goal is generalization — doing well on data the model has never seen. Big networks can simply memorize the training set, so most of the craft of deep learning is the set of tricks that stop them from doing that.
When validation loss turns upward while training loss keeps falling, the model has started memorizing. Stop at the dashed line.
Step size for each update. Too high and training diverges or oscillates; too low and it barely moves. A schedule that decays it over time usually helps. Tune this first.
Larger batches train faster per epoch and give smoother gradients; smaller batches add helpful noise and use less memory. It interacts with learning rate — change them together.
Fraction of units disabled per step, often 0.1–0.5. Higher fights overfitting harder but can underfit. Off at inference time — the full network is used to predict.
These are hyperparameters — settings you choose, not values the model learns. Searching them well is most of the day-to-day work.
A plain feed-forward net treats every input independently. Real data has structure — pixels near each other relate, words come in order — and these three families bake that structure into the network itself.
A convolution slides a small filter (kernel) across the image, detecting the same local pattern — an edge, a texture — anywhere it appears. Pooling then shrinks the map, building up from edges to shapes to objects. Two big wins: far fewer weights than a dense layer, and translation invariance (a cat is a cat wherever it sits).
A filter scans for local patterns; pooling condenses; layers stack edges into objects.
An RNN reads a sequence one step at a time, carrying a hidden state — a running memory — from step to step. Plain RNNs forget long-range context (the vanishing gradient problem), so LSTM and GRU add gates that decide what to keep and what to discard, holding information across longer spans.
The same cell runs at each step, passing its hidden state forward as memory of what came before.
Instead of stepping through a sequence, a transformer uses self-attention: every token looks at every other token at once and weighs which ones matter for its meaning. Because there's no recurrence, the whole sequence trains in parallel — which is what made training on internet-scale data practical.
Each token (here "cat") weighs every other token to build its context — all in parallel.
You'll rarely build these from scratch. Pick the family that fits your data, then start from a pretrained model — the next section.
The math is one thing; getting a model trained and shipped is another. Three realities shape every project — and one shortcut makes deep learning practical for normal teams.
Training is mostly large matrix multiplications, and GPUs run thousands of those in parallel — orders of magnitude faster than a CPU. Google's TPUs are a purpose-built alternative. Mixed precision (lower-precision math) speeds things up and saves memory.
Deep nets need a lot of labeled examples, and quality matters more than cleverness. Most real projects spend more effort on collecting, cleaning, and labeling data than on the model itself. Garbage in, garbage out.
Start from a model already trained on huge data, then adapt it to your task with a fraction of the data and compute. This is the default workflow — training from scratch is the exception.
Keep the pretrained backbone, attach a small task-specific head, and train just that (or lightly fine-tune the rest).
You define the network, the framework computes gradients automatically (autodiff) and runs it on the GPU. The differences are about ergonomics, deployment, and scale.
Pythonic, eager-by-default (you debug it like normal code), and the dominant choice in research with the largest model ecosystem. torch.compile closes much of the old speed gap.
Pro — easiest to learn and debug; huge community and pretrained models.
Con — production serving and mobile need extra tooling.
Choose when you're learning, experimenting, or doing most research and applied work — the safe default.
Keras is the high-level API anyone can read; Keras 3 even runs on PyTorch or JAX backends. TensorFlow brings mature serving (TF Serving) and on-device deployment (LiteRT, formerly TF Lite).
Pro — battle-tested deployment path to servers, web, and mobile.
Con — smaller research mindshare today; more API surface area.
Choose when you need a clean high-level API (Keras) or a proven road to production and edge devices.
A functional, NumPy-like core with composable transforms — grad, jit, vmap — compiled via XLA. Paired with libraries like Flax and Optax; a favorite for large-scale training, especially on TPUs.
Pro — top-tier speed and clean scaling across many accelerators.
Con — steepest learning curve; functional style and more boilerplate.
Choose when you're pushing performance or training big models on TPUs and want fine control.
Don't train from scratch by default.Between pretrained models on the Hugging Face Hub and hosted APIs, you can often solve the problem without owning a training pipeline at all. Build the custom model only when off-the-shelf genuinely can't do the job.
Five quick questions on neurons, training, overfitting, architectures, and tooling — instant feedback, no sign-in.
Navigate with ← → or scroll · back to library