Deep Learning & Neural Networks

01 · From classical ML to deep learning 4 min

Deep learning trades
hand-crafted features for learned ones.

This deck assumes the basics from Machine Learning Fundamentals — features, labels, training vs. test, over/underfitting. Deep learning is one branch of ML: it stacks many simple layers so the model discovers its own features from raw data, instead of you engineering them by hand.

Deep learning — machine learning with neural networks that are "deep" — meaning many stacked layers. Each layer transforms the previous one's output, so early layers learn simple patterns (edges, word fragments) and later layers combine them into complex ones (faces, meaning). "Deep" just refers to that depth of layers.

The core shift

Classical ML — a human picks the features (column shapes, word counts, ratios), then a model like a decision tree or logistic regression learns from them.
Deep learning — you feed in raw pixels, audio, or text and the network learns the features itself as part of training.
That's the trade: less manual feature work, but far more data and compute, and a model that's harder to interpret.

Classical ML leans on a human to engineer features; a deep network learns its own features straight from the raw input.

When each one wins — honestly

Reach for classical ML when…

You have tabular data (rows and columns). Gradient-boosted trees — XGBoost, LightGBM, CatBoost — usually win here, train in minutes, and run on a laptop.
Data is small (hundreds to low thousands of rows). Deep nets are data-hungry and overfit small sets.
You need explainability, fast iteration, or cheap inference.

Reach for deep learning when…

Data is unstructured — images, audio, video, or natural language — where features are hard to hand-design.
You have lots of data (or a pretrained model to build on) and access to GPUs.
The pattern is complex and non-linear, and a few extra points of accuracy justify the cost.

The honest default: try the simple model first. A deep net is rarely the cheapest path to a working baseline.

02 · The neuron & the network 5 min

A neuron is just weighted sum → activation.

The whole edifice is built from one tiny operation repeated millions of times: multiply each input by a weight, add them up with a bias, then pass the result through a non-linear function. Stack these into layers and you get a network.

Neuron (unit) — a function that computes y = activation(w·x + b). The weights w say how much each input matters, the bias b shifts the result, and the activation adds the non-linearity that lets stacked layers learn curves, not just straight lines.

// one neuron, by hand function neuron(x, w, b) { let z = b for (let i = 0; i < x.length; i++) z += x[i] * w[i] // weighted sum return relu(z) // activation: max(0, z) } // a layer = many neurons; a network = stacked layers

Inputs are scaled by weights, summed with a bias, then squashed by an activation — that one unit, repeated, is the whole network.

A feed-forward network: every unit in one layer feeds every unit in the next. Each connection has its own weight to learn.

Reading the network

Input layer — your raw numbers (pixels, token embeddings, sensor readings).
Hidden layers — where features get built; more layers / wider layers = more capacity (and more risk of overfitting).
Output layer — the answer: one number for regression, or a probability per class for classification.
Parameters = all the weights and biases. Modern models have millions to billions of them.

Activation functions — the non-linearity

Without an activation, stacking layers collapses into a single linear step — no matter how deep. The activation is what lets a network bend.

ReLU

The default

max(0, z). Cheap, trains fast, the standard choice for hidden layers. Variants (LeakyReLU, GELU) fix edge cases.

Sigmoid

Squash to 0–1

Maps any number into (0, 1). Used for a single yes/no output. Saturates at the ends, so avoid it deep inside.

Tanh

Squash to −1–1

Zero-centered cousin of sigmoid. Common inside older recurrent networks; mostly displaced by ReLU in feed-forward layers.

Softmax

Probabilities

Turns a vector of scores into a probability distribution that sums to 1 — the standard final layer for multi-class classification.

03 · How networks learn 6 min

Guess, measure the error, nudge every weight downhill.

Training is a loop. The network makes a prediction, a loss function scores how wrong it was, backpropagation works out how each weight contributed to that error, and gradient descent nudges every weight a little in the direction that reduces it. Repeat for many passes over the data.

Loss function — a single number measuring how wrong the prediction is. Lower is better. Use cross-entropy for classification and mean squared error (MSE) for regression. Training is just the search for weights that make this number small.

One training step: predict, score the error, propagate it back to every weight, then update. The loop runs thousands of times.

The vocabulary

Forward pass — run inputs through the network to get a prediction.
Backpropagation — the chain rule applied backwards through the layers to compute the gradient: how much each weight affected the loss.
Gradient descent — step each weight a little against its gradient. The step size is the learning rate.
Optimizer — the rule that turns gradients into updates. Adam / AdamW are the common defaults; plain SGD still wins in some settings.

Each step rolls the weights a little further down the loss curve. Too big a learning rate overshoots; too small crawls.

Batches and epochs

Batch— a small group of examples processed together before one weight update. We rarely use the whole dataset at once (that's too slow and memory-heavy).
Epoch — one full pass over the entire training set. Training takes many epochs.
Step / iteration — one batch processed = one update.
Rule of thumb: watch the validation loss, not just training loss. When it stops improving, you're done — more epochs just overfit (next section).

Like hiking down into a valley in fog — you can't see the bottom, so you keep stepping in whatever direction goes downhill.

04 · Keeping it honest 5 min

A model that aces training and fails in the wild has overfit.

The goal is generalization — doing well on data the model has never seen. Big networks can simply memorize the training set, so most of the craft of deep learning is the set of tricks that stop them from doing that.

Overfitting — the model learns the training data's noise, not its signal. You spot it when training loss keeps dropping but validation loss starts rising. The opposite, underfitting, is a model too simple to capture the pattern at all.

When validation loss turns upward while training loss keeps falling, the model has started memorizing. Stop at the dashed line.

How to fight it

More / better data — the most reliable fix. Often via data augmentation (flips, crops, noise) to multiply what you have.
Early stopping — halt training when validation loss stops improving.
Regularization (weight decay / L2) — penalize large weights so the model stays simpler.
Dropout— randomly switch off a fraction of units each step so the network can't lean on any one path.
Simplify — fewer layers/units. The smallest model that works generalizes best.

The knobs you'll actually turn

Learning rate

The most important knob

Step size for each update. Too high and training diverges or oscillates; too low and it barely moves. A schedule that decays it over time usually helps. Tune this first.

Batch size

Speed vs. noise

Larger batches train faster per epoch and give smoother gradients; smaller batches add helpful noise and use less memory. It interacts with learning rate — change them together.

Dropout rate

How much to drop

Fraction of units disabled per step, often 0.1–0.5. Higher fights overfitting harder but can underfit. Off at inference time — the full network is used to predict.

These are hyperparameters — settings you choose, not values the model learns. Searching them well is most of the day-to-day work.

05 · Architecture families 6 min

Match the structure of the network to the structure of the data.

A plain feed-forward net treats every input independently. Real data has structure — pixels near each other relate, words come in order — and these three families bake that structure into the network itself.

Convolutional Neural Networks — for grids (images)

A convolution slides a small filter (kernel) across the image, detecting the same local pattern — an edge, a texture — anywhere it appears. Pooling then shrinks the map, building up from edges to shapes to objects. Two big wins: far fewer weights than a dense layer, and translation invariance (a cat is a cat wherever it sits).

Use for — images, video frames, anything on a grid; also spectrograms for audio.
Names to know — ResNet, EfficientNet, U-Net (for segmentation); Vision Transformers now compete on large datasets.

A filter scans for local patterns; pooling condenses; layers stack edges into objects.

Strength

Spatial patterns, few parameters, robust to where things appear.

Watch for

Needs lots of labeled images from scratch — almost always start from a pretrained backbone instead.

Recurrent Networks — for ordered sequences

An RNN reads a sequence one step at a time, carrying a hidden state — a running memory — from step to step. Plain RNNs forget long-range context (the vanishing gradient problem), so LSTM and GRU add gates that decide what to keep and what to discard, holding information across longer spans.

Use for — time series, sensor streams, and smaller or strictly streaming sequence tasks.
Reality check — for most text and long-sequence work, transformers have largely replaced RNNs since they process the whole sequence in parallel.

The same cell runs at each step, passing its hidden state forward as memory of what came before.

Strength

Natural fit for streaming data and modest-length sequences.

Weakness

Sequential by nature — slow to train, and long-range memory is hard.

Transformers — attention over the whole sequence

Instead of stepping through a sequence, a transformer uses self-attention: every token looks at every other token at once and weighs which ones matter for its meaning. Because there's no recurrence, the whole sequence trains in parallel — which is what made training on internet-scale data practical.

Powers the LLMs — this is the architecture behind modern language models. Go deeper in Building LLM Apps and Fine-tuning.
Not just text — Vision Transformers, audio, and multimodal models all use the same backbone.

Each token (here "cat") weighs every other token to build its context — all in parallel.

Strength

Long-range context, parallel training, scales with data and compute.

Cost

Attention grows with sequence length squared, and it's the most data- and compute-hungry of the three.

You'll rarely build these from scratch. Pick the family that fits your data, then start from a pretrained model — the next section.

06 · Practicalities 4 min

What it actually takes: compute, data, and not starting from scratch.

The math is one thing; getting a model trained and shipped is another. Three realities shape every project — and one shortcut makes deep learning practical for normal teams.

Compute · GPUs

Why GPUs (and TPUs)

Training is mostly large matrix multiplications, and GPUs run thousands of those in parallel — orders of magnitude faster than a CPU. Google's TPUs are a purpose-built alternative. Mixed precision (lower-precision math) speeds things up and saves memory.

Data hunger

Data is the bottleneck

Deep nets need a lot of labeled examples, and quality matters more than cleverness. Most real projects spend more effort on collecting, cleaning, and labeling data than on the model itself. Garbage in, garbage out.

Transfer learning

Stand on a pretrained model

Start from a model already trained on huge data, then adapt it to your task with a fraction of the data and compute. This is the default workflow — training from scratch is the exception.

Transfer learning — reusing a model trained on one large task as the starting point for another. The early layers have already learned general features (edges, grammar); you keep those and retrain only what's specific to your problem. It turns a million-example problem into a few-thousand-example one.

Keep the pretrained backbone, attach a small task-specific head, and train just that (or lightly fine-tune the rest).

Where to get pretrained models

Hugging Face Hub — the de-facto registry for pretrained models across text, vision, and audio.
Framework model zoos — torchvision, Keras Applications, and TensorFlow Hub ship common backbones with weights.
For LLMs specifically — fine-tuning with adapters like LoRA is its own discipline; see Fine-tuning.
Once it works, shipping and monitoring it is MLOps — that track covers serving, versioning, and drift.

07 · Tooling, trade-offs & recap 4 min

Three frameworks, one mental model —
pick for your job, not the hype.

You define the network, the framework computes gradients automatically (autodiff) and runs it on the GPU. The differences are about ergonomics, deployment, and scale.

PyTorch

The research & default standard

Pythonic, eager-by-default (you debug it like normal code), and the dominant choice in research with the largest model ecosystem. torch.compile closes much of the old speed gap.

Pro — easiest to learn and debug; huge community and pretrained models.

Con — production serving and mobile need extra tooling.

Choose when you're learning, experimenting, or doing most research and applied work — the safe default.

TensorFlow / Keras

Production & on-device

Keras is the high-level API anyone can read; Keras 3 even runs on PyTorch or JAX backends. TensorFlow brings mature serving (TF Serving) and on-device deployment (LiteRT, formerly TF Lite).

Pro — battle-tested deployment path to servers, web, and mobile.

Con — smaller research mindshare today; more API surface area.

Choose when you need a clean high-level API (Keras) or a proven road to production and edge devices.

JAX

Performance at scale

A functional, NumPy-like core with composable transforms — grad, jit, vmap — compiled via XLA. Paired with libraries like Flax and Optax; a favorite for large-scale training, especially on TPUs.

Pro — top-tier speed and clean scaling across many accelerators.

Con — steepest learning curve; functional style and more boilerplate.

Choose when you're pushing performance or training big models on TPUs and want fine control.

How to choose — the short version

Default to PyTorch. It has the gentlest learning curve and the most examples, tutorials, and pretrained weights to copy from.
Reach for Keras/TensorFlow when you want a simpler high-level API or a well-trodden deployment-to-edge story.
Reach for JAX when raw performance and large-scale TPU training matter more than ergonomics.
And first ask if you need deep learning at all — for tabular data and small datasets, a gradient-boosted tree from classical ML is faster, cheaper, and often more accurate.

Don't train from scratch by default.Between pretrained models on the Hugging Face Hub and hosted APIs, you can often solve the problem without owning a training pipeline at all. Build the custom model only when off-the-shelf genuinely can't do the job.

Five things to walk out with

1Deep learning learns features.That's the trade vs. classical ML — less hand-engineering, more data and compute.

2It's weighted sums + activations, stacked into layers and trained by gradient descent on a loss.

3The enemy is overfitting. Watch validation loss; lean on more data, early stopping, regularization, and dropout.

4Match architecture to data — CNN for grids, RNN/LSTM for streams, transformers for sequences and the LLMs built on them.

5Start pretrained, simplest tool first. Often the right move is classical ML or a fine-tuned existing model — not a new net.

Knowledge check

Did it stick?

Five quick questions on neurons, training, overfitting, architectures, and tooling — instant feedback, no sign-in.

Rate this deck

be the first

Navigate with ← → or scroll · back to library

Deep Learning& the NeuralNetworks behind it.

Deep learning tradeshand-crafted features for learned ones.

The core shift

When each one wins — honestly

A neuron is just weighted sum → activation.

Reading the network

Activation functions — the non-linearity

The default

Squash to 0–1

Squash to −1–1

Probabilities

Guess, measure the error, nudge every weight downhill.

The vocabulary

Batches and epochs

A model that aces training and fails in the wild has overfit.

How to fight it

The knobs you'll actually turn

The most important knob

Speed vs. noise

How much to drop

Match the structure of the network to the structure of the data.

Convolutional Neural Networks — for grids (images)

Recurrent Networks — for ordered sequences

Transformers — attention over the whole sequence

What it actually takes: compute, data, and not starting from scratch.

Why GPUs (and TPUs)

Data is the bottleneck

Stand on a pretrained model

Where to get pretrained models

Three frameworks, one mental model —pick for your job, not the hype.

The research & default standard

Production & on-device

Performance at scale

How to choose — the short version

Five things to walk out with

Did it stick?

Deep Learning
& the Neural
Networks behind it.

Deep learning trades
hand-crafted features for learned ones.

Three frameworks, one mental model —
pick for your job, not the hype.