Library
00/07 · ~36 min
GUIDEDECK · PART 2 · for engineers who already test

Advanced
Testing
past the green checkmark.

Part 2 of Testing & TDD. The intro gave you the pyramid, the TDD loop, and the five test doubles. This session goes deep: generators that hunt edge cases, contracts between services, testing your tests, and the dark arts of time, concurrency, and the things you don't own.

~36 MININTERMEDIATE → ADVANCEDLANGUAGE-AGNOSTIC
SCROLL
01 · Property-based testing 6 min

Stop guessing inputs.
Generate them by the thousand.

This is Part 2 — it builds straight on the intro deck, so we assume you're fluent with example-based tests and the arrange-act-assert shape. The leap here: instead of asserting an output for one hand-picked input, you state a property that must hold for allinputs, then let the framework manufacture hundreds of them — including the cruel ones you'd never think to type.

Property-based testing (PBT) assert a universal rule, not a specific output. You describe a generator (the shape of valid input) and a property(a boolean that must stay true for every value the generator produces). The runner samples — typically 100–1000 cases per run — and the moment one fails it shrinksit to the smallest reproducing example. Origin: Haskell's QuickCheck (2000); today fast-check, Hypothesis, and jqwik carry the idea into JS, Python, and the JVM.
Example-based 3 inputs you thought of [], [1], [1,2,3] Property-based generator → 1000s [], [NaN], [-0], [MAX_INT]… SHRINK [5,-3,9,2,-7] ✕ [-3,9] ✕ [-3] ✕ [-1] minimal

Example tests probe where you already looked. Generators probe where you didn't — then shrink the failure to a one-line repro.

Why it finds bugs you can't

  • The generator explores the edges on purpose — empty, zero, negative, Unicode, huge, duplicate, NaN.
  • A failure isn't a vague stack trace — shrinking hands you the smallest input that breaks the property.
  • Every run prints a seed; paste it back to replay the exact failing sequence deterministically.
  • It pushes you to articulate what must always be true — often the most valuable part.
// fast-check (TypeScript) import fc from "fast-check" test("decode(encode(x)) === x for ALL orders", () => { fc.assert( fc.property(orderArb, (order) => { const back = decode(encode(order)) expect(back).toEqual(order) // the invariant }), { numRuns: 500, seed: 42 } // reproducible ) })
generate order run property pass ✓ fail ✕ shrink

Generate → run → on failure, shrink toward the minimal counterexample, then report it with its seed.

The hard part isn't the tool — it's naming a property

You rarely have an oracle that recomputes the right answer. These five archetypes are how experienced testers find properties without one.

Round-trip (there-and-back)
decode(encode(x)) === x
+

Anything that serializes, parses, compresses, or encrypts should come back unchanged. The single most productive property to write — it catches lossy edge cases (trailing whitespace, Unicode normalization, -0) for free.

=
Invariant (something always holds)
a truth about the output regardless of input
+

After sort(xs): the result is the same length, is a permutation of the input, and each element ≤ the next. You assert the shape of correctness without recomputing the answer.

Idempotence
f(f(x)) === f(x)
+

Normalizing a path, trimming a string, applying a migration twice — doing it again should change nothing. A classic source of real-world bugs in "retry-safe" code.

Oracle / model-based
compare against a simpler, obviously-correct version
+

Test the fast implementation against a slow, naïve one you trust — or against the old code during a refactor. Also: commutativity (add(a,b) === add(b,a)) and comparison to a reference library.

Metamorphic relation
how the output changes when you tweak the input
+

When there's no oracle at all: assert a relationship. Searching for "cat" should return at least as many hits as "black cat". Scaling every price by 2 should double the cart total. Powerful for ML and search.

Go further · stateful PBT

Generate whole sequences of operations

Model-based / stateful testing generates random sequences of commands against your system and a tiny in-memory model, asserting they agree at every step — fast-check's fc.commands, Hypothesis' RuleBasedStateMachine. It finds ordering and concurrency bugs a single-call property never would.

02 · Contract testing 5 min

Catch a broken integration
without spinning up both services.

Two services agree on an API. Service B changes a field; Service A breaks in production — and no unit test caught it, because each side mocked the other with its own assumptions. Contract testing pins those assumptions to a shared, verifiable artifact.

Consumer-driven contract (CDC) testing the consumer records exactly what it needs; the provider proves it still delivers. The consumer writes a test against a mock and emits a contract (a pact file: the requests it makes and the responses it relies on). The providerreplays that contract against its real code in its own pipeline. No shared environment, no end-to-end run — just two fast, independent test suites that can't silently diverge.
Consumer test Web app pact.json the contract Pact Broker shares + versions Provider verify Orders API · real code publishes pulls + verifies

The consumer's expectations become a file the provider must honor — checked in two separate pipelines, never live together.

Where it sits in the pyramid

  • It replaces most cross-service integration tests — the slow, flaky band the intro deck warned about.
  • A broken contract fails at build time, in the pipeline of the team that broke it — not at 2 a.m. in prod.
  • It verifies the shape and meaningthe consumer actually uses, not the provider's full surface area.
  • can-i-deployqueries the broker: "is every consumer of this version still compatible?" before you ship.
// Pact v3 consumer test (the WEB side) const pact = new PactV3({ consumer: "Web", provider: "Orders" }) pact .given("order 42 exists") // provider state .uponReceiving("a request for order 42") .withRequest({ method: "GET", path: "/orders/42" }) .willRespondWith({ status: 200, body: { id: integer(42), total: decimal(9.9) } }) await pact.executeTest(async (mock) => { const o = await fetchOrder(mock.url, 42) expect(o.id).toBe(42) // drives the contract })
matchers integer(42) decimal(9.9) match TYPE id: 99, total: 5.0 passes ✓ id: "x", total: null breaks ✕

Match on type, not exact value — the contract pins the shape the consumer depends on, and tolerates new data it ignores.

Provider state

given("order 42 exists")

Each interaction names a precondition. On the provider side you wire that state name to a fixture (seed the row, stub the dep) so verification is deterministic.

Broker · can-i-deploy

The compatibility matrix

The broker (PactFlow or self-hosted) tracks which consumer and provider versions verified against each other. can-i-deploy gates the release on a green matrix.

Not just HTTP

Async & schema-first

Pact covers message contracts (Kafka, queues) too. For schema-only guarantees, OpenAPI / Protobuf + buf breaking-change checks complement — they verify shape, not consumer intent.

03 · Mutation testing 5 min

100% coverage,
zeroassertions. Who's testing the tests?

Coverage tells you a line ran. It says nothing about whether your test would notice if that line were wrong. Mutation testing measures exactly that — by deliberately breaking your code and checking whether your suite screams.

Mutation testing grade your tests by breaking your code. The tool generates mutants: tiny single-edit variants of your source — < becomes <=, + becomes -, a return true flips, a statement is deleted. It runs your suite against each one. If a test fails, the mutant is killed (good — your tests caught the bug). If every test still passes, the mutant survived — a real assertion gap.
if (age >= 18) original age > 18 killed ✓ age <= 18 killed ✓ age >= 17 survived ✕ true equivalent ?

age >= 17 survives → no test exercises the boundary at 17. That survivor is a precise instruction for the test to write next.

The mutation score

Mutation score = killed / (total − equivalent). A survivor is a place your code could be wrong and no test would tell you. It is a far stronger signal than line coverage — and far slower to compute.
  • Killed — a test failed on the mutant. What you want.
  • Survived — all tests passed. A genuine gap: a missing assertion or an untested boundary.
  • No coverage — the line never ran. A faster tool (coverage) already told you that.
  • Timeout — the mutant caused an infinite loop; counted as killed.
Equivalent mutant a mutation that changes the code but not its behavior. i < n vs i != n in a loop that only ever increments produces identical results, so no test can kill it. Equivalent mutants are undecidable in general and are the reason 100% is the wrong target — they inflate the denominator with unkillable noise. Chase survivors that represent real behavior, not the last few percent.
High coverage, weak test
// runs every line — asserts almost nothing test("applies discount", () => { const total = price(100, { vip: true }) expect(total).toBeDefined() // mutants survive freely })
Same coverage, kills mutants
// pins the value AND the boundary test("VIP gets 20% off", () => { expect(price(100, { vip: true })).toBe(80) expect(price(100, { vip: false })).toBe(100) })

It's slow — every mutant reruns (a slice of) your suite, so a full run can take hours. Make it practical: run it on changed files onlyin CI (Stryker's incremental mode, PIT's history), set a sane threshold (a break below, say, 60% on touched code), and treat the survivor list as a to-do, not a vanity metric.

04 · Test architecture 5 min

A test suite is a codebase.
Design its data like one.

Most suite rot isn't bad assertions — it's unmanaged test data and hidden coupling. A 40-line setup block, a shared mutable fixture, an order-dependent pass: these are the things that make a green suite untrustworthy. Architect them deliberately.

Test data builder a tiny helper that constructs a valid object with sane defaults and lets each test override only what it cares about. It replaces brittle, repeated literals so a test reads as "a user, but banned" — making the one relevant variable obvious. Its cousins: the Object Mother (named canned instances: aVipCustomer()) and factories (Factory Boy, fishery, FactoryBot) that also persist to a DB.
// defaults make every test start from "valid" const aUser = (over: Partial<User> = {}): User => ({ id: randomId(), name: "Test User", status: "active", createdAt: clock.now(), ...over, // override ONLY what matters }) // the test states its one relevant fact, nothing else const banned = aUser({ status: "banned" }) expect(canCheckout(banned)).toBe(false)
aUser() valid defaults active banned override vip

One factory, many variants. The test names only the field under test — everything else is a trusted default.

Fixtures vs factories — know the difference

Fixture

A prepared environment

Scoped setup/teardown — a temp DB, a logged-in client, a seeded row. pytest's @fixture with function/module/session scope is the model: declare a dependency, the runner wires it in and tears it down. Risk: a session-scoped fixture mutated by one test leaks into the next.

Factory

A builder that makes data

Produces fresh, independent instances on demand — optionally persisted. Prefer factories over shared fixtures for entities: each test owns its data, so there's nothing to leak. Reserve fixtures for expensive infrastructure (the container, the connection pool).

Flakiness is a design problem, not a retry problem

flaky test passes... sometimes shared state real time / clock test order network / async concurrency

Every flake traces to non-determinism. Remove the source; don't paper over it with retry(3).

The discipline

  • Isolation — each test creates and destroys its own data; no order dependence. Randomize order (pytest-randomly) to expose hidden coupling.
  • Determinism — pin every seed (random, PBT, faker), inject the clock, freeze IDs. Same input, same result, every run.
  • Quarantine, don't ignore— move a known-flaky test to a tagged lane that doesn't block the build, file a ticket, and fix the root cause. A retry that "fixes" flakiness is hiding a real race.
05 · Testing the hard stuff 6 min

Time, concurrency, and the
services you don't own.

The intro deck introduced the five doubles. Now the hard cases: code that depends on the wall clock, on race conditions, on a third-party API. The unifying move is the same every time — turn the uncontrollable thing into an injected dependency you command from the test.

Time

Never read the clock directly

Date.now() / datetime.now()sprinkled in business logic makes "expires in 30 days" untestable. Inject a Clock and advance it by hand — or freeze it with fake timers (vi.useFakeTimers, sinon), freezegun (Python), or a fixed java.time.Clock.

Async / concurrency

Wait on conditions, not on sleep

sleep(500) is the #1 cause of flake — too short it races, too long it crawls. Await a condition (the element, the event, the queue depth). For races, use a deterministic scheduler or controllable fake timers so you decide interleavings.

External services

Own your boundary

Wrap the third-party SDK behind your interface, then test against a fake of that interface. Plus real-ish options: Testcontainers (real Postgres/Kafka in Docker), WireMock / MSW for HTTP, record-replay cassettes (VCR).

// the clock is a dependency, not a global interface Clock { now(): Date } class Token { constructor(private clock: Clock) {} isExpired(t: Token) { return t.expiresAt < this.clock.now() } } // test: a clock you control, no real waiting const clock = { now: () => new Date("2030-01-01") } expect(new Token(clock).isExpired(old)).toBe(true)
Token logic Clock SystemClock (prod) FixedClock (test)

Inject the clock and the "expires" logic is just data — real clock in prod, frozen clock in the test.

Hermetic test a test that depends on nothing outside its own process: no network, no real clock, no shared DB, no ambient state. Hermetic tests are fast, parallel-safe, and never flaky. Everything in this section is in service of making more of your tests hermetic.

Fakes vs mocks — the deep version

Mock — couples to the interaction
// asserts HOW the code calls the dep const repo = { save: jest.fn() } register(repo, user) expect(repo.save).toHaveBeenCalledWith(user) // rename/reorder the call → test breaks, // even though behavior is identical
Fake — couples to the behavior
// a real, in-memory implementation const repo = new InMemoryUserRepo() register(repo, user) expect(repo.findByEmail(user.email)).toEqual(user) // asserts the OUTCOME — refactor-proof
Repo contract one shared test suite InMemoryRepo fast · used in unit tests PostgresRepo real · Testcontainers same tests must pass on BOTH ✓

Run the same contract suite against the fake and the real thing. If both pass, your fake is faithful — and fast unit tests stay trustworthy.

The two rules that keep fakes honest

  • Don't mock what you don't own. Mocking a third-party SDK bakes your guess about its behavior into the test. Wrap it behind your own port and fake that.
  • Verify the fake against reality. A fake is only useful if it behaves like the real one — pin that with a shared contract test run against both (this is the same idea as Section 2, applied in-process).
  • Use mocks sparingly — for verifying an interaction that is the behavior (an email was sent), never as a default.
06 · The advanced tooling 4 min

The kit for everything
in this deck.

One representative tool per technique, per ecosystem — with the honest trade-off and the moment to reach for it. All are mature and actively maintained as of 2026.

JS / TS · fast-check

fast-check

Pro: first-class TS types, excellent shrinking, integrates with Jest/Vitest, stateful fc.commands.
Con: writing good generators for complex domains takes practice.

Reach for it on any JS/TS pure logic: parsers, pricing, encoders.

Python · Hypothesis

Hypothesis

Pro: the gold standard — rich strategies, a saved example database that replays past failures, RuleBasedStateMachine.
Con: generation can be slow; tunesettings deadlines.

Reach for it as the default for any Python logic worth testing hard.

JVM · jqwik

jqwik

Pro: a JUnit 5 engine, so it drops into existing Java/Kotlin suites; integrated shrinking.
Con: smaller community; more boilerplate than Hypothesis.

Reach for it when you're already on JUnit 5 and want PBT alongside.

Polyglot · Pact

Pact

Pro: the de-facto consumer-driven standard; clients in every major language; HTTP + messaging; a broker with can-i-deploy.
Con: a real learning curve; provider states and broker ops add moving parts.

Reach for it when independent teams ship services on separate cadences.

JVM · Spring Cloud Contract

Spring Cloud Contract

Pro: producer-driven, deeply integrated with Spring; generates stubs for consumers.
Con: Spring-centric; awkward outside the JVM ecosystem.

Reach for it in an all-Spring estate where the provider leads.

Schema · OpenAPI / buf

Schema checks

Pro: cheap breaking-change detection on REST/gRPC; no consumer test needed.
Con: verifies shape, not what a consumer actually relies on.

Reach for it as a fast first line; pair with Pact for intent.

JS / TS · Stryker

Stryker

Pro: mature, great HTML reports, incrementalmode for CI, also .NET & Scala flavors.
Con: slow on big suites without scoping to changed files.

Reach for it to audit a critical module's tests, on the diff in CI.

JVM · PIT

PIT (pitest)

Pro: fast (bytecode mutation), the JVM standard, history-driven incremental runs.
Con: Java-centric config; report tuning takes effort.

Reach for it as the default mutation tool on any JVM project.

Python · mutmut

mutmut / cosmic-ray

Pro: simple to drop in; clear survivor diffs to act on.
Con: notably slow; best scoped to a package, not the whole repo.

Reach for it to harden the tests around one high-stakes Python module.

E2E · Playwright

Playwright

Pro:cross-browser, auto-waiting (kills most flake), traces & video, parallel sharding, first-class API + component testing.
Con: still the slowest, most expensive band — keep it thin.

Reach for it for a handful of critical user-journey smoke tests — not feature coverage.

Infra · Testcontainers

Testcontainers

Pro: a real Postgres / Kafka / Redis in Docker, disposable per run — high fidelity, no shared env.
Con: needs Docker; slower than an in-memory fake.

Reach for it to verify the real adapter (and your fake) against the genuine engine.

HTTP · WireMock / MSW

WireMock · MSW

Pro: stub external HTTP at the network edge; MSW shares mocks between tests and the browser.
Con: stubs encode your assumptions — pair with contract tests for the ones you own.

Reach for it to isolate from third-party HTTP you can't run locally.

A two-line decision guide

  • Pure logic with clear invariants? Property-based (fast-check / Hypothesis / jqwik) before you write 20 example cases by hand.
  • Multiple services, separate teams? Contract tests (Pact) instead of brittle cross-service E2E.
  • Tests you don't trust on critical code? Mutation testing (Stryker / PIT / mutmut) on the diff to find the gaps.
  • A real dependency in the loop? Testcontainers or WireMock / MSW for fidelity; Playwright only for the thin top of the pyramid.
07 · Worked example · recap 5 min

Two real tests,
then the five things to remember.

A property test and a contract test, end to end — the two techniques most likely to change how your team works this quarter.

1 · A property test that found a real bug

// PROPERTY: splitting a bill loses no cents import fc from "fast-check" test("split(total, n) always sums back to total", () => { fc.assert(fc.property( fc.integer({ min: 0, max: 1e6 }), // cents fc.integer({ min: 1, max: 50 }), // people (total, n) => { const parts = split(total, n) const sum = parts.reduce((a, b) => a + b, 0) expect(sum).toBe(total) // invariant expect(parts).toHaveLength(n) })) }) // shrinks to split(1, 3) → [0,0,0], sum 0 ≠ 1 ✕
split(1, 3) 0 0 0 sum = 0, but total = 1 — a cent vanished

No example test would've tried 1 ÷ 3. The shrinker hands you the exact minimal repro — a classic remainder bug.

2 · A contract test, both sides

Consumer publishes the pact
pact.given("user 7 exists") .uponReceiving("GET /users/7") .withRequest({ method: "GET", path: "/users/7" }) .willRespondWith({ status: 200, body: { id: integer(7), email: like("a@b.co") } }) // → publishes users-web.json to the broker
Provider verifies it
await new Verifier({ provider: "Users", pactBrokerUrl: BROKER, stateHandlers: { "user 7 exists": () => db.seed({ id: 7 }), }, }).verifyProvider() // fails the build if Users drops/renames a field
1State properties, generate inputs. Round-trip, invariant, idempotence, oracle, metamorphic — then let shrinking hand you the minimal bug.
2Contract, don't integrate. Pin cross-service assumptions in a consumer-driven pact; gate releases with can-i-deploy.
3Mutate to grade your tests. Coverage proves a line ran; mutation score proves a test would catch the break. Hunt survivors, ignore equivalents.
4Engineer the data and the determinism. Builders for data, factories over shared fixtures, injected clocks and pinned seeds — kill flakiness at the source.
5Own your boundaries.Fake interfaces you own; never mock what you don't; verify every fake against the real thing.
Knowledge check

Did it stick?

Five questions on property-based, contract, and mutation testing, plus the hard-stuff techniques — instant feedback, no sign-in.

Rate this deck
be the first

Navigate with ← → or scroll · back to library