Zero-Downtime Migration of Amazon Business Checkout to Microservices Using the Strangler Fig Pattern
Migrating a system that serves millions of users a day is not a refactoring problem. It’s a risk management problem.
The Amazon Business Checkout page was one of those systems. A battle-tested Java monolith responsible for computing business logic — pricing rules, tax calculations, purchase approval workflows, entitlements — and rendering the entire checkout experience for enterprise customers. It had accumulated years of tightly coupled logic and an equally long list of reasons why you couldn’t just rewrite it. Every enterprise order on Amazon flowed through it. There was no maintenance window. There was no acceptable error rate.
The Strangler Fig pattern gave us the framework. The challenge was the execution — how do you extract business logic and rendering responsibilities from a live monolith into independent microservices without a single dropped request, without visible degradation, and without a cutover you can’t reverse?
This is how we did it.
Why the Strangler Fig, and Why It’s Harder Than It Looks
The Strangler Fig pattern — named after the vine that grows around a tree until it can support itself independently — is the standard playbook for incrementally replacing a legacy system. Route traffic through a new facade, shift responsibility service by service, decommission once the old system is no longer load-bearing.
Simple in theory. The hard part is “incrementally.” In practice, you need a concrete answer to a question at every step: how do you know the new services are computing the right thing before you trust them with real traffic?
At the scale of Amazon Business Checkout, “wrong” doesn’t mean a flaky test. It means an enterprise customer sees an incorrect price, a purchase order approval fires incorrectly, or a tax calculation diverges across thousands of line items. The three phases we designed — shadow traffic with parity checking, cutover, and deprecation — were each an answer to that question. Not a clever architecture diagram. A trust-building exercise, one phase at a time.
Phase 1: Shadow Traffic and Parity Checking
Before routing any real traffic to the new microservices, we needed to prove they were correct — not in staging, not in unit tests, but under actual production load.
We did this by fanning out every incoming checkout request through a message queue. The monolith remained the synchronous, authoritative path: it owned the response the customer received. The new microservices consumed the same requests asynchronously from the queue, executed their business logic in the background, and produced their own computed results. Then we compared them.
flowchart LR
A[Checkout Request] --> B[Request Router]
B -->|Synchronous — owns response| C[Monolith]
B -->|Async fan-out| D[Message Queue]
D --> E[New Microservices]
C -->|Authoritative result| F[Customer Response]
E -->|Shadow result| G[Parity Checker]
C -->|Expected result| G
G -->|Mismatch| H[Alerts & Logs]
The parity checker ran continuously in the background, comparing every monolith output against the shadow result from the new services. Any divergence was alarming.
// checkout/ShadowRouter.java
public CheckoutResult handleRequest(CheckoutRequest request) {
// Primary: monolith, synchronous, owns the customer-facing response
CheckoutResult authoritative = monolith.compute(request);
// Shadow: publish to queue for async processing by new services
eventQueue.publish(CheckoutEvent.from(request));
return authoritative;
}
// checkout/ParityChecker.java
@QueueConsumer
public void onShadowResult(ShadowResultEvent event) {
CheckoutResult shadow = event.getShadowResult();
CheckoutResult expected = event.getAuthoritativeResult();
if (!shadow.equals(expected)) {
metrics.increment("checkout.parity.mismatch",
Tag.of("service", event.getServiceName()));
logger.warn("Parity mismatch",
"requestId", event.getRequestId(),
"monolithResult", expected,
"shadowResult", shadow
);
}
}
The exit condition for this phase was unambiguous: zero mismatches sustained across a full observation window that included peak traffic. Not 99.9%. Zero. For a checkout system, even a tiny mismatch rate represents real enterprise customers receiving wrong pricing or broken approval flows.
Early on we hit mismatches regularly — edge cases in B2B pricing logic that hadn’t surfaced in tests, subtle differences in how entitlement rules were evaluated under concurrent requests, tax jurisdiction handling that behaved differently at high throughput. Each mismatch was a bug report. We fixed them, watched the rate fall, and only moved forward once it had held at zero through several peak periods.
This phase was the most important one. It’s where the new services earned the right to be trusted.
Phase 2: The Cutover
With parity verified, the cutover itself was deliberately anticlimactic — and that’s exactly how it should be.
We flipped a feature flag to route primary compute traffic to the new microservices. The monolith remained fully deployed. If anything anomalous appeared, we could revert within seconds without a rollback deployment, without a war room scramble, and without any customer experiencing a degraded state.
flowchart LR
A[Checkout Request] --> B[Request Router]
B -->|Feature flag OFF| C[Monolith]
B -->|Feature flag ON| D[New Microservices]
C --> E[Customer Response]
D --> E
// checkout/ComputeRouter.java
public CheckoutResult computeCheckout(CheckoutRequest request) {
if (featureFlags.isEnabled("checkout.microservices.primary", request.getCustomerId())) {
return newCheckoutService.compute(request);
}
return monolith.compute(request);
}
The flag rollout was gradual — 1%, 5%, 25%, 50%, 100% — with a monitoring hold at each threshold. At every step we watched p50, p95, and p99 latencies for each service independently, error rates, and business-level metrics: checkout completion rate, order confirmation rate, pricing accuracy alerts. At every step, the signal was clean.
The feature flag was not just a safety net. It was the entire philosophy of the cutover encoded in one boolean: every stage is reversible until you decide it isn’t.
The unglamorous truth about a well-executed cutover is that it should feel like nothing happened. It did.
Phase 3: Deprecation
Once the new microservices had operated as the primary compute path through several full traffic cycles — including peak events — we turned off shadow traffic and began decommissioning the monolith.
flowchart LR
A[Checkout Request] --> B[Request Router]
B --> C[New Microservices]
C --> D[Customer Response]
E[Monolith]:::deprecated
classDef deprecated fill:#f5f5f5,stroke:#ccc,color:#999
The temptation after a clean cutover is to clean up immediately. The right move is to wait. The monolith had been running for years and had institutional knowledge encoded in its behavior that no design document fully captured. Keeping it alive — observable but not in the critical path — is cheap insurance.
Deprecation happened in stages:
- Stop shadow traffic — disable the message queue fan-out. The new services are now the sole compute path.
- Drain the queue — ensure no in-flight events are still being processed in the background.
- Set the monolith to read-only mode — it can still serve as a reference for debugging, but can no longer affect customer-facing output.
- Decommission — archive the codebase, remove the routing layer, shut down the monolith instances.
When the last monolith instance shut down, the migration was complete. A checkout system that had computed pricing, entitlements, taxes, and approval flows in a single Java monolith for years was now a set of independent, independently deployable, independently scalable microservices — and no customer had noticed a thing.
What I Learned
- The phases are not steps, they’re gates. Each phase has an explicit, measurable exit condition. The instinct to accelerate — to move from shadow traffic to cutover before parity is perfect — is exactly how migrations fail at scale.
- Shadow traffic is the most valuable thing you can build. You cannot trust a new service until you’ve proven it computes the same result as the old one under real production load. The time spent in Phase 1 is directly proportional to the confidence you have in Phase 2.
- Feature flags are not a deployment convenience. At Amazon’s scale, a flag that lets you shift 1% of traffic and immediately reverse is a fundamentally different risk profile than a binary deploy-or-rollback. It’s what makes a “cutover” actually zero-downtime.
- Async fan-out lets you run in shadow for as long as you need. The message queue pattern means the new services process every production request for weeks without any customer being exposed to their output. By the time you cut over, the services have seen more real traffic than any load test could generate.
- Decommissioning is a phase, not an afterthought. The migration isn’t done when the new system is live. It’s done when the old one is gone. Keeping shadow writes running longer than necessary is technical debt with a real operational cost.
Migrations like this one don’t succeed because of clever architecture. They succeed because of discipline — the willingness to move slowly, measure everything, and treat each phase as a commitment that requires evidence before you advance.
If you’re working through a similar migration or want to dig into any of the tradeoffs, connect with me on GitHub or LinkedIn.