Skip to main content
Real-World Stack Migrations

From Legacy to Live: How a Guerrilla Team Migrated a Monolith to Microservices Without Losing a Single User

Every team that inherits a monolith dreams of a clean break. But the nightmare is real: a migration that drags on for months, breaks user sessions, loses data, or forces a full rollback on launch day. This guide tells the story of a guerrilla team—small, cross-functional, and autonomous—that moved a monolith to microservices without losing a single user. We'll show you exactly how they did it, what they prioritized, and where most teams stumble. Who Needs This and What Goes Wrong Without It If your team is responsible for a monolith that's slowing down deployments, making scaling expensive, or causing cascading failures, you're the audience. This guide is for engineers, tech leads, and product managers who need a repeatable strategy, not a textbook architecture diagram. Without a structured migration approach, teams often fall into three traps.

Every team that inherits a monolith dreams of a clean break. But the nightmare is real: a migration that drags on for months, breaks user sessions, loses data, or forces a full rollback on launch day. This guide tells the story of a guerrilla team—small, cross-functional, and autonomous—that moved a monolith to microservices without losing a single user. We'll show you exactly how they did it, what they prioritized, and where most teams stumble.

Who Needs This and What Goes Wrong Without It

If your team is responsible for a monolith that's slowing down deployments, making scaling expensive, or causing cascading failures, you're the audience. This guide is for engineers, tech leads, and product managers who need a repeatable strategy, not a textbook architecture diagram.

Without a structured migration approach, teams often fall into three traps. The first is the big-bang rewrite: building a new system from scratch and flipping a switch. This almost always fails because the new system misses edge cases the monolith handled for years. The second trap is analysis paralysis—spending months drawing service boundaries without writing a line of production code. The third is the 'microservices in name only' anti-pattern: splitting code into separate repos but keeping a shared database and synchronous calls, which just recreates the monolith with network latency.

What's at stake is user trust. A migration that causes downtime, lost orders, or corrupted profiles erodes confidence faster than any performance gain can restore. The guerrilla team we followed avoided all these pitfalls by focusing on incremental extraction with continuous delivery.

Why Most Migrations Stall

Common reasons include: unclear ownership of shared data, lack of feature flags to toggle traffic, and insufficient observability to compare old and new systems side by side. Without these three pillars, teams either move too slowly or too recklessly.

Prerequisites and Context to Settle First

Before touching a line of microservice code, the team established a few non-negotiable prerequisites. First, they instrumented the monolith with structured logging and distributed tracing (using OpenTelemetry). Without this, you cannot compare behavior between old and new paths. Second, they introduced feature flags via a simple in-house toggle service. Every extracted feature would be guarded by a flag that could route individual users or a percentage of traffic to the new service.

Third, they agreed on a data ownership principle: each microservice owns its data, but during migration, the monolith remains the source of truth. This means the new service writes to its own database but also writes back to the monolith's database until the migration is complete. This dual-write strategy ensures that rollback is always possible without data loss.

Team Composition and Communication

The guerrilla team consisted of four engineers: two backend, one frontend, one DevOps. They had a dedicated product owner who shielded them from feature requests during the migration. Their communication cadence was a daily 15-minute standup focused on blockers, not status updates. They used a shared dashboard showing traffic percentages, error rates, and latency for both the monolith and each extracted service.

Tooling Stack

They chose tools they already knew: Kubernetes for orchestration, gRPC for inter-service communication, and PostgreSQL for persistence. No exotic databases or message brokers were introduced during the migration—they deferred that complexity. The key was to minimize unfamiliar moving parts while learning the microservices workflow.

Core Workflow: The Strangler Fig in Action

The migration followed the strangler fig pattern: incrementally replace parts of the monolith with new services, routing traffic through a reverse proxy (nginx) that decides which backend handles each request. The workflow had four steps repeated for each microservice.

Step 1: Identify a Seam

Look for a domain boundary that has clear inputs and outputs, minimal shared state, and a team that understands it. The guerrilla team started with the user profile service—it had a single database table, a REST API, and no real-time dependencies. They drew a bounded context around it and defined the contract (API and data schema) for the new service.

Step 2: Extract and Dual-Write

They built the new service to handle read and write requests, but initially all traffic still went to the monolith. The new service listened to a message queue (Kafka) for profile change events from the monolith and wrote them to its own database. This dual-write phase ran for two weeks while the team validated data consistency with automated reconciliation jobs.

Step 3: Shift Traffic Gradually

Using feature flags, they routed 1% of profile read requests to the new service. They monitored error rates, latency, and data freshness. After a day with zero errors, they increased to 5%, then 20%, then 50%. Each increase required a sign-off from the on-call engineer. Writes were switched only after reads were stable at 100%.

Step 4: Remove the Old Code

Once the new service handled all traffic for the domain, they deleted the corresponding code from the monolith. This step is often skipped, leading to dead code and confusion. The team made it a rule: no old code stays after a service is fully extracted.

Tools, Setup, and Environment Realities

You don't need a fancy service mesh to start. The guerrilla team used nginx with Lua scripting for traffic splitting, a simple feature flag service built on Redis, and a cron job for data reconciliation. Their CI/CD pipeline automatically deployed the monolith and microservices separately, with canary analysis that compared error budgets.

Observability Stack

They used Prometheus for metrics, Grafana for dashboards, and a custom script that compared the monolith's response to the microservice's response for the same request. This 'diff' dashboard caught subtle bugs like different date formatting or missing fields.

Database Migration Strategy

Dual-writes are the hardest part. They used a write-ahead log (WAL) from PostgreSQL to stream changes to the new service's database. This avoided the need for application-level dual-writes in the monolith, which would have required changes to the old codebase. The new service subscribed to the WAL and applied changes to its own schema, which was slightly different (normalized tables instead of a single profile table).

Environment Parity

They maintained three environments: development, staging (with synthetic traffic), and production. Staging ran the same Kubernetes configuration as production, but with a fraction of the data. They replayed production traffic from a week ago to stress-test the new service before any user saw it.

Variations for Different Constraints

Not every team has the luxury of a dedicated guerrilla squad or a Kafka cluster. Here are variations for common constraints.

Small Team, No Dedicated DevOps

If you're a three-person startup, skip Kubernetes. Use a single server with Docker Compose and a reverse proxy like Caddy that can do path-based routing. Extract services one at a time, and keep the database shared until you have the resources to split it. The key is to still use feature flags and canary releases, even if the canary is just a separate container on the same machine.

High Compliance Requirements

If your monolith handles financial or health data, you need audit logs and data retention policies before migrating. Use database triggers to log all changes during dual-write, and run reconciliation reports daily. The guerrilla team in a regulated environment added a manual approval step before each traffic increase.

Monolith with No Test Coverage

If your monolith lacks tests, don't start by writing unit tests for old code—that's a sunk cost. Instead, write contract tests for the API endpoints you're extracting, and use record-and-replay tools (like VCR) to capture real interactions. The new service should be tested against recorded traffic from production, not synthetic tests.

Time-Boxed Migration

If leadership gives you three months, prioritize the most painful service first—usually the one with the highest change frequency or slowest deployment. The guerrilla team used a 'value vs. risk' matrix: extract high-value, low-risk services first to build momentum, then tackle the hard ones with more experience.

Pitfalls, Debugging, and What to Check When It Fails

Even with careful planning, things break. Here are the most common failure modes and how to catch them early.

Data Inconsistency During Dual-Write

The most frequent issue is that the new service doesn't see all updates because of race conditions. For example, a user updates their profile while the migration is in progress, and the change is written to the monolith but not yet replicated to the new service. To catch this, run a reconciliation job every hour that compares a random sample of records. If discrepancies exceed 0.01%, pause the migration and fix the replication lag.

Latency Spikes from Cross-Service Calls

When you extract a service, the monolith now needs to make an HTTP or gRPC call to get data it previously accessed locally. This can increase p99 latency by 10-50ms. Mitigate by adding caching (Redis) in the monolith for the extracted data, and set a timeout on the call so that if the microservice is slow, the monolith falls back to its own database.

Feature Flag Hell

Too many flags become unmanageable. The guerrilla team limited each service extraction to a single flag that controlled routing for all endpoints of that domain. They also set a maximum flag lifetime: 30 days after full cutover, the flag was removed. This prevented permanent toggles that accumulate technical debt.

Rollback Plan

Every traffic increase should have a one-click rollback to the previous state. The team used a script that updated the nginx config to send all traffic back to the monolith. They practiced rollbacks weekly during the migration so that muscle memory was ready when a real incident occurred.

When things do fail, the first step is to check the diff dashboard. If the new service returns a different response than the monolith for the same request, you have a logic bug, not an infrastructure issue. If the new service is slower, look at connection pools and database query plans. If data is missing, check the WAL replication lag and the reconciliation logs.

After the migration, the guerrilla team held a retrospective that produced a checklist for the next service: (1) write contract tests, (2) set up diff dashboard, (3) run reconciliation every hour, (4) practice rollback, (5) define success criteria (error rate < 0.1%, latency within 10% of monolith). They repeated this process for six services over six months, never losing a single user session or order.

Share this article:

Comments (0)

No comments yet. Be the first to comment!