The Architecture Decisions That Will Haunt You (And How to Make Them Well)

Series: The Modern SDLC · Post 2 of 17 ← Post 1: Conception and Discovery · Post 3: Developer Toolchain →

Architecture decisions are the ones that age worst. A poorly written function is easy to fix. A naming convention that nobody likes is annoying but changeable. An architectural decision that turns out to be wrong — the wrong deployment topology, the wrong data strategy, the wrong communication pattern between services — is expensive, slow, and sometimes impossible to undo without rebuilding significant chunks of the system.

The frustrating part is that most architectural mistakes don't announce themselves. They feel like the right call at the time. They only reveal themselves six months later, when the system is under load, when the team has grown, or when the requirements have shifted in a way nobody anticipated.

This post won't tell you which architecture is right for your system. Nobody can tell you that without knowing your context. What it will do is give you the framework for making architectural decisions well — and help you recognise the patterns that consistently lead teams astray.

The one thing to remember

Every architectural decision is a trade-off. Your job is to understand what you're trading, not to find the option with no downsides.

The team that picks microservices because it's modern is making the wrong decision. The team that picks microservices because they have six independent teams deploying at different cadences and the blast radius of a shared deployment is costing them too much — that team might be making the right decision. Same architecture, completely different reasoning. Reasoning is what makes it right or wrong.

Document decisions as you make them: Architecture Decision Records

Before anything else — establish the habit of writing Architecture Decision Records.

An ADR is a short document, half a page to a page, that captures four things: the context (what situation prompted this decision), the decision (what you chose), the rationale (why you chose it over the alternatives), and the consequences (what becomes easier and what becomes harder as a result).

Store them in /docs/adr/ in your repository, numbered sequentially (0001-use-postgres-as-primary-database.md). Review them in the same pull request as the work they govern.

The value compounds over time. Future engineers stop relitigating decisions that were already made carefully. Onboarding is faster because the reasoning is written down rather than living in one person's head. And when the context changes — when the team grows, when requirements shift, when the original rationale no longer holds — the ADR tells you exactly why the decision was made, which tells you whether it still makes sense.

If you adopt only one practice from this post, make it this one.

The most important decision: monolith vs microservices

This is the single most consequential architectural decision you'll make early in a project, and the one most commonly made for the wrong reasons.

The case for starting with a monolith is stronger than most teams realise. A well-structured monolith is simpler to develop, simpler to test, simpler to deploy, simpler to debug, and dramatically cheaper to operate than a distributed system. Cross-cutting concerns — authentication, logging, transactions, database access — are straightforward. You don't need a service mesh, a distributed tracing system, or a contract testing suite. Refactoring across domain boundaries is a function call, not an API contract negotiation.

The case for microservices is real, but it's specific. Independent deployability — the ability for one team to ship without coordinating with another — is genuinely valuable at scale. Independent scalability matters when different parts of your system have dramatically different load profiles. Fault isolation matters when a failure in one area must not cascade to another.

The problem is that most teams adopting microservices early don't have the scale that makes those benefits real. What they get instead is a distributed monolith — services that are physically separate but logically coupled, sharing a database or calling each other synchronously in chains — with all the complexity of a distributed system and none of the benefits.

The modern consensus is this: start with a modular monolith. Enforce clean boundaries between domains inside the monolith — separate modules, clear interfaces, no reaching across boundaries directly. Then extract services at the points where you have a specific, nameable pain that a service boundary would solve. "We want to be cloud-native" is not a specific pain. "Our payments team can't deploy without a full regression test of the order system" is.

The modular monolith is the best of both worlds early on: the simplicity of a monolith with the internal discipline that makes future extraction possible.

Synchronous vs asynchronous communication

Once you have more than one service — or more than one significant domain within your monolith — you face the question of how they talk to each other.

Synchronous communication (REST, gRPC) is simple. Service A calls service B and waits for a response. The request and response are tightly coupled in time. This is easy to understand, easy to debug, and easy to trace. It's also fragile: if service B is slow or down, service A is slow or down.

Asynchronous communication (message queues, event streams) decouples the sender and receiver in time. Service A publishes an event. Service B processes it when it's ready. This is more resilient — service B can be down and the messages queue up. It's also higher throughput — service A doesn't wait. But it's harder to understand, harder to debug, and requires explicit handling of ordering, idempotency, and eventual consistency.

The practical approach most mature teams converge on: use synchronous communication for queries (read something and respond immediately) and for commands where the caller needs an immediate result (create an order and confirm it was created). Use asynchronous communication for side effects — things that should happen as a consequence of an event but don't need to be synchronous with it (send an email when an order is placed, update a search index when a product changes, notify a downstream system when a payment clears).

The mistake to avoid is two-phase commit — trying to maintain distributed transactions across services. If service A and service B both need to update their state atomically, you either share a database (and lose service independence) or you end up with complex distributed transaction protocols that are notoriously hard to get right. The better pattern is the outbox pattern or saga pattern, which accept eventual consistency and compensate for failures explicitly.

Data strategy

Data is where architectural decisions get irreversible fastest. Schema changes are hard. Data migrations at production scale are slow and risky. Data that gets spread across the wrong stores at the start of a project tends to stay there.

A few decisions worth getting right early:

Database per service vs shared database. A shared database is simpler to query — cross-domain joins work, reporting is straightforward, transactions are easy. But it couples services at the data layer, meaning a schema change in one service can break another. Database per service gives you true independence but requires explicit data contracts between services and careful handling of eventual consistency. For a monolith or early-stage system, a shared database is fine. When you start extracting services, extract the data with them.

Choosing the right database type for each workload. Most modern systems use several database types. The question isn't "SQL or NoSQL" — it's which workload fits which engine. Relational databases (Postgres is the default recommendation) for transactional data, complex queries, strong consistency, and audit trails. Document stores (MongoDB) for flexible schemas and nested data. Key-value stores (Redis) for caching, sessions, and rate limiting. Time-series databases (TimescaleDB) for metrics and events. Don't put everything in one database because it's simpler — put each workload where it fits.

Migrations as code. Database schema changes should be version-controlled migration scripts, run automatically as part of deployment. Never change a production schema manually. Every migration should be reviewable, testable, and — where possible — backward-compatible with the previous application version. Backward-compatible migrations are what make zero-downtime deployments possible.

The CAP theorem in practice

If you're building a distributed system, you will eventually face a situation where the CAP theorem becomes relevant. It's worth understanding before that moment rather than during it.

CAP states that a distributed system can only guarantee two of three properties: Consistency (every read returns the most recent write), Availability (every request receives a response), and Partition tolerance (the system continues to operate when network partitions occur). Since network partitions are a reality in any distributed system, you're effectively choosing between consistency and availability when things go wrong.

This isn't an abstract academic concern. It translates directly to design decisions:

For a banking system, you choose consistency. A user who transfers money must see the updated balance immediately, even if that means refusing requests during a network issue rather than returning stale data. Returning stale data in a financial context is worse than returning an error.

For a social media feed, you choose availability. A user seeing a feed that's a few seconds out of date is acceptable. A user unable to load their feed because the system refused the request for consistency reasons is not.

The practical implication: think through CAP for each type of data in your system, not for the system as a whole. Your user profile data probably needs consistency. Your activity feed can tolerate eventual consistency. Model this explicitly rather than applying one strategy uniformly.

Non-functional requirements belong in the architecture, not the backlog

The most consistently underspecified aspect of early architecture work is non-functional requirements — the performance, reliability, scalability, and security requirements that constrain what solutions are possible.

These questions should be answered before the architecture is designed, not discovered after it's built:

What is the acceptable latency for the most common user-facing request? What load does the system need to handle at peak, and how far above current estimates should you design for? What is the availability target — 99.9% (43 minutes of downtime per month) or 99.99% (4 minutes)? What are the data retention requirements? What regulatory frameworks apply?

Non-functional requirements that are discovered after the architecture is designed are expensive. A system designed for 100 requests per second that needs to handle 10,000 requires architectural changes, not optimisation. A system designed without data residency requirements that turns out to need GDPR compliance in the EU needs redesign of its data model and potentially its infrastructure.

Write the non-functional requirements down. Document them in the ADR for each major architectural decision. "This service is designed for a p95 latency of under 200ms at 1,000 requests per second" is a testable, reviewable constraint. "This service should be fast" is not.

Cloud-native vs lift-and-shift

If your system runs in the cloud — and most modern systems do — you'll face a spectrum of choices about how cloud-native to be.

Fully cloud-native means using managed services wherever possible: managed databases rather than self-hosted, serverless functions for event-triggered workloads, managed queues rather than self-hosted Kafka, managed Kubernetes rather than running your own control plane. The benefits are real: you don't operate the infrastructure, you get automatic scaling, you pay for what you use, and you're up and running faster.

The cost is vendor dependency. The more tightly your architecture is coupled to a specific cloud provider's services, the harder it is to move if pricing changes, service quality degrades, or strategic reasons require a different provider.

The pragmatic approach: use managed services freely, but abstract them behind thin interfaces in your own code. Your business logic shouldn't import AWS SDK directly — it should call a NotificationService interface that happens to be backed by SNS today and could be backed by something else tomorrow. You won't always take advantage of that abstraction, but when you need it, you have one seam to cut rather than two hundred call sites.

Serverless (Lambda, Cloud Functions) makes sense for event-triggered, short-lived, spiky workloads — the kind that would otherwise require a service running idle most of the time. It doesn't make sense for long-running services, latency-sensitive workloads, or anything with significant cold-start sensitivity. Containers on managed Kubernetes remain the most flexible option for long-running services with consistent traffic.

The meta-rule: choose boring technology

The most durable architectural advice, and the hardest to follow when you're surrounded by interesting new options, is this: choose boring technology for your core, and interesting technology only where it gives you a genuine, specific advantage.

Every novel architectural choice is a debt. You pay it in operational complexity — nobody on your team has seen this failure mode before. You pay it in hiring — the pool of engineers who know this technology is smaller. You pay it in debugging — the community knowledge base is thinner.

Postgres, Redis, Kafka, Kubernetes, and standard cloud services are boring. They're boring because thousands of teams have run them at scale, written about their failure modes, and built tooling around them. That boredom is a feature.

When you introduce something novel — a new database paradigm, an untested framework, a custom protocol — ask specifically: what problem does this solve that a boring alternative doesn't? If the answer is "it's more elegant" or "it's what the cool teams are using," that's not a sufficient answer. If the answer is "it reduces our data pipeline latency from 30 seconds to 200 milliseconds and that directly affects our core user experience," that might be sufficient.

What goes wrong when you get this wrong

The distributed monolith. Services that are physically separate but logically coupled — calling each other synchronously in chains, sharing a database, or requiring coordinated deployments. You get all the complexity of a distributed system and none of the benefits.

Premature microservices. A team of five building twelve services. Nobody can keep the whole system in their head. Debugging requires tracing requests across eight services. Local development requires running the entire ecosystem. The overhead consumed more capacity than the benefits produced.

The schema trap. A data model designed for the first version of requirements that becomes load-bearing for every subsequent version. Migrations become progressively more dangerous and every schema change requires careful coordination across teams.

The architecture-by-trend decision. Choosing event sourcing because it's interesting, not because it solves a specific problem. Choosing a graph database because the data is theoretically a graph, not because the query patterns require it. Introducing a service mesh on day one of a two-service system. Novel choices made for non-specific reasons accumulate into systems that are genuinely hard to operate.

If you do one thing from this post

Start an ADR file for your next architectural decision, even a small one. Write down the context, the options you considered, the option you chose, and why you chose it over the others.

The discipline of writing the rationale — not just the decision — is what forces the thinking. If you can't articulate why you chose this over the alternatives, you haven't made the decision yet; you've made a guess.

Next up: Post 3 — Your Developer Toolchain Is Either a Force Multiplier or a Tax

← Post 1: The Phase That Determines Whether Your Project Succeeds or Fails

The Architecture Decisions That Will Haunt You (And How to Make Them Well)

The Architecture Decisions That Will Haunt You (And How to Make Them Well)

The one thing to remember

Document decisions as you make them: Architecture Decision Records

The most important decision: monolith vs microservices

Synchronous vs asynchronous communication

Data strategy

The CAP theorem in practice

Non-functional requirements belong in the architecture, not the backlog

Cloud-native vs lift-and-shift

The meta-rule: choose boring technology

What goes wrong when you get this wrong

If you do one thing from this post

Comments

The Modern SDLC

Your Developer Toolchain Is Either a Force Multiplier or a Tax — Here's How to Make It the First One

More from this blog

Docker & Kubernetes: What They Are, Why They Matter, and How to Get Started

Introduction to Rancher: Wrangling Kubernetes Clusters at Scale

Networking Fundamentals: A Beginner's Guide to How the Internet Actually Works

Distributed Systems: Wrap-Up

Observability: Understanding Your System at Runtime

Command Palette

The Architecture Decisions That Will Haunt You (And How to Make Them Well)

The one thing to remember

Document decisions as you make them: Architecture Decision Records

The most important decision: monolith vs microservices

Synchronous vs asynchronous communication

Data strategy

The CAP theorem in practice

Non-functional requirements belong in the architecture, not the backlog

Cloud-native vs lift-and-shift

The meta-rule: choose boring technology

What goes wrong when you get this wrong

If you do one thing from this post

Comments

The Modern SDLC

Your Developer Toolchain Is Either a Force Multiplier or a Tax — Here's How to Make It the First One

More from this blog