Design PatternsResilience

Bulkheads: Because One Sinking Service Shouldn't Sink Them All

OL
Oscar van der Leij
10 min read
Bulkheads: Because One Sinking Service Shouldn't Sink Them All

The first time I watched a single microservice take down an entire platform, I genuinely could not believe what I was seeing. One recommendation engine, badly misconfigured, started holding database connections open and never letting go. Within a few minutes the checkout service ran dry. Then authentication. Then inventory. The monitoring dashboard looked like a fireworks display, and every alert was pointing at a service that, on its own, did not even serve customers directly. We had built a system of supposedly independent services that were, in practice, one big shared pool waiting for something to go wrong.

The embarrassing part is how long it took me to understand why. Each service had its own codebase, its own deployment pipeline, its own team. But they all drank from the same well: shared connection pools, shared thread budgets, shared memory. Independence at the code level meant nothing when the underlying resources were communal. One hungry neighbour could starve everyone else, and there was nothing the other services could do about it.

When Sharing Isn't Caring

Think of a submarine. Naval architects don't design submarines as single, open spaces. They divide them into watertight compartments called bulkheads. When a torpedo strikes and water rushes in, the crew can seal off the damaged section, containing the flood to just that compartment. The rest of the submarine remains dry and operational. This principle, born from centuries of maritime engineering, offers a powerful metaphor for software architecture.

In distributed systems, we face a similar challenge. Services share resources: database connection pools, thread pools, memory heaps, network sockets, API rate limits. This sharing seems efficient until one service misbehaves. A poorly optimized query might hold database connections indefinitely. A sudden traffic spike might exhaust all available threads. A memory leak might consume the entire heap. Without proper isolation, these failures cascade through the system like dominoes.

The bulkhead pattern addresses this by introducing intentional segmentation. Instead of allowing all services to draw from a common resource pool, we partition resources into isolated compartments. Each service or group of services gets its own dedicated allocation. When one service fails, it can only consume its own resources. The blast radius is contained. Other services continue operating normally, blissfully unaware that their neighbor is drowning.

Service AResource PoolHealthyService BResource PoolEXHAUSTEDService CResource PoolHealthy

The Anatomy of Isolation

Bulkheads operate at multiple levels of your architecture. Understanding where and how to apply them requires thinking about resource types and failure modes.

Resource-Level Bulkheads

At the most fundamental level, bulkheads partition shared resources:

  • Connection Pools: Instead of a single database connection pool shared by all services, each service gets its own pool with defined limits. Your recommendation service gets 20 connections, your checkout service gets 50, your reporting service gets 10. When recommendations goes haywire and exhausts its 20 connections, checkout still has its full allocation.
  • Thread Pools: Similarly, thread pools can be partitioned by function. HTTP request handling threads are separate from background job processing threads, which are separate from scheduled task threads. A runaway background job can't starve the system of threads needed to serve user requests.
  • Memory Allocation: In containerized environments, memory limits act as bulkheads. Each service container gets a defined memory ceiling. One service's memory leak can't consume memory needed by other services.
  • Circuit Capacity: Rate limiters and circuit breakers create bulkheads around external dependencies. You might allow your system to make 1000 calls per minute to a payment gateway, but partition that capacity: 700 for checkout, 200 for subscription management, 100 for refunds.

Service-Level Bulkheads

Beyond individual resources, bulkheads can isolate entire service instances:

  • Deployment Isolation: Critical services run on dedicated infrastructure. Your payment processing service doesn't share compute resources with your image optimization service. If image processing consumes all CPU cycles generating thumbnails, payments continue unaffected.
  • Tenant Isolation: In multi-tenant systems, high-value customers might get dedicated service instances. When a low-priority tenant generates excessive load, it affects only its own bulkhead, not premium customers.
  • Geographic Isolation: Regional deployments act as natural bulkheads. A failure in your US-East region doesn't cascade to Europe or Asia-Pacific.

The Trade-offs Table

Aspect Without Bulkheads With Bulkheads
Resource Efficiency Higher (shared pools maximize utilization) Lower (dedicated allocations may sit idle)
Failure Isolation None (cascading failures common) Strong (failures contained to bulkhead)
System Complexity Lower (simpler resource management) Higher (more configuration and monitoring)
Performance Ceiling Higher (can burst into unused capacity) Limited (hard caps prevent resource borrowing)
Operational Overhead Lower (fewer pools to manage) Higher (must tune each bulkhead separately)

The key insight: bulkheads trade resource efficiency for resilience. You're intentionally accepting some resource underutilization to prevent catastrophic failures.

Building Your Bulkheads

The following walkthrough builds a small but complete local demo. You will need .NET 8 SDK and Docker Desktop installed. Every command runs in a terminal from the project root.

1. Scaffold the project

Create a minimal Web API and add the Npgsql connection library:

dotnet new webapi -n BulkheadDemo --no-https
cd BulkheadDemo
dotnet add package Npgsql --version 8.*

2. Start a local database

Create a docker-compose.yml at the project root:

services:
  postgres:
    image: postgres:16-alpine
    environment:
      POSTGRES_USER: demo
      POSTGRES_DB: demo
      POSTGRES_HOST_AUTH_METHOD: trust
    ports:
      - "5432:5432"

POSTGRES_HOST_AUTH_METHOD: trust tells Postgres to accept all connections from the local network without a credential. This is fine for a throwaway dev container. Never use it in production.

Start it with:

docker compose up -d

Verify it is running before continuing:

docker compose ps

You should see the postgres container with status running.

3. Add the connection string

Open appsettings.Development.json and add:

{
  "ConnectionStrings": {
    "Default": "Host=localhost;Port=5432;Database=demo;Username=demo;Maximum Pool Size=5"
  }
}

The database credential is injected at runtime via an environment variable override in step 6, so it never has to live in a committed config file. The Maximum Pool Size=5 is intentionally small so you can see the bulkhead fill up quickly in the next steps.

4. Create a semaphore-based bulkhead

This is the core of the pattern. A SemaphoreSlim limits how many concurrent operations a caller can run. When all slots are taken, new callers wait up to a timeout and then get rejected.

Create BulkheadPolicy.cs:

namespace BulkheadDemo;


public sealed class BulkheadPolicy
{
    private readonly SemaphoreSlim _semaphore;
    private readonly string _name;
    private readonly TimeSpan _waitTimeout;


    public BulkheadPolicy(string name, int maxConcurrency, TimeSpan waitTimeout)
    {
        _name = name;
        _semaphore = new SemaphoreSlim(maxConcurrency, maxConcurrency);
        _waitTimeout = waitTimeout;
    }


    public int Available => _semaphore.CurrentCount;


    public async Task<T> ExecuteAsync<T>(Func<Task<T>> operation)
    {
        var acquired = await _semaphore.WaitAsync(_waitTimeout);
        if (!acquired)
            throw new BulkheadRejectedException(
                $"Bulkhead '{_name}' is at capacity. Try again later.");
        try
        {
            return await operation();
        }
        finally
        {
            _semaphore.Release();
        }
    }
}


public class BulkheadRejectedException(string message) : Exception(message);

5. Register two isolated bulkheads

Open Program.cs and replace its contents:

using BulkheadDemo;
using Npgsql;


var builder = WebApplication.CreateBuilder(args);
var connectionString = builder.Configuration.GetConnectionString("Default")!;


// Two separate concurrency limits, backed by the same Npgsql data source.
// Checkout gets 3 slots; recommendations gets 2.
// A runaway recommendations load cannot touch checkout's allocation.
builder.Services.AddKeyedSingleton("checkout",
    new BulkheadPolicy("checkout", maxConcurrency: 3, TimeSpan.FromSeconds(2)));


builder.Services.AddKeyedSingleton("recommendations",
    new BulkheadPolicy("recommendations", maxConcurrency: 2, TimeSpan.FromSeconds(1)));


builder.Services.AddSingleton(_ => new NpgsqlDataSourceBuilder(connectionString).Build());


var app = builder.Build();


// Simulate a fast checkout query
app.MapGet("/checkout", async (
    [FromKeyedServices("checkout")] BulkheadPolicy bulkhead,
    NpgsqlDataSource db) =>
{
    return await bulkhead.ExecuteAsync(async () =>
    {
        await using var conn = await db.OpenConnectionAsync();
        await using var cmd = conn.CreateCommand();
        cmd.CommandText = "SELECT 1";
        await cmd.ExecuteScalarAsync();
        return Results.Ok(new { service = "checkout", status = "ok" });
    });
})
.WithName("Checkout");


// Simulate a slow recommendations query (5-second artificial delay)
app.MapGet("/recommendations", async (
    [FromKeyedServices("recommendations")] BulkheadPolicy bulkhead,
    NpgsqlDataSource db) =>
{
    return await bulkhead.ExecuteAsync(async () =>
    {
        await using var conn = await db.OpenConnectionAsync();
        await Task.Delay(TimeSpan.FromSeconds(5)); // simulate slow query
        return Results.Ok(new { service = "recommendations", status = "ok" });
    });
})
.WithName("Recommendations");


// Health endpoint shows available slots in each bulkhead
app.MapGet("/health", (
    [FromKeyedServices("checkout")] BulkheadPolicy checkout,
    [FromKeyedServices("recommendations")] BulkheadPolicy recommendations) =>
    Results.Ok(new
    {
        checkout = new { available = checkout.Available },
        recommendations = new { available = recommendations.Available }
    }));


app.Run();

6. Run the demo

Start the API:

dotnet run --urls http://localhost:5000

Open a second terminal and fire several concurrent requests at the slow recommendations endpoint:

# Windows PowerShell
$jobs = 1..6 | ForEach-Object {
    Start-Job { 
        try { Invoke-RestMethod http://localhost:5000/recommendations | ConvertTo-Json }
        catch { "REJECTED: $($_.Exception.Message)" }
    }
}
$jobs | Wait-Job | ForEach-Object { Receive-Job $_ }
$jobs | Remove-Job
# macOS / Linux
for i in {1..6}; do curl -s http://localhost:5000/recommendations & done; wait

You will see the first two requests succeed after five seconds. The remaining four are rejected immediately with a BulkheadRejectedException because the recommendations bulkhead only allows two concurrent operations.

Now hit checkout while recommendations is saturated:

# Windows PowerShell
Invoke-RestMethod http://localhost:5000/checkout

# macOS / Linux
curl http://localhost:5000/checkout

Checkout responds instantly. Its bulkhead is completely unaffected by whatever is happening in recommendations. That is the entire point.

Check the health endpoint to see slot counts in real time:

# Windows PowerShell
Invoke-RestMethod http://localhost:5000/health

# macOS / Linux
curl http://localhost:5000/health

7. Observe and tune

The health endpoint is where you add real observability in production. A practical next step is to expose these counters to your metrics system. With .NET's built-in System.Diagnostics.Metrics you can do that without pulling in extra libraries. Add the following to BulkheadPolicy.cs:

private static readonly Meter Meter = new("BulkheadDemo");
private readonly ObservableGauge<int> _availableGauge;

// Inside the constructor, after creating _semaphore:
_availableGauge = Meter.CreateObservableGauge(
    $"bulkhead.{name}.available",
    () => _semaphore.CurrentCount,
    description: "Available slots in this bulkhead");

Prometheus, Grafana, or any OpenTelemetry-compatible backend can then scrape this gauge and alert when a bulkhead's available count drops toward zero, giving you early warning before callers start seeing rejections.

The Captain's Checklist

Implementing bulkheads transforms your architecture from a fragile, interconnected web into a resilient system of isolated compartments:

  • Partition critical shared resources into dedicated pools per service or function, accepting some inefficiency to prevent cascading failures
  • Monitor bulkhead utilization continuously and alert when thresholds are approached, treating saturation as an early warning system
  • Design graceful degradation strategies for when bulkheads reach capacity, whether through queuing, rejection with retry signals, or fallback responses
  • Balance isolation granularity against operational complexity, starting with coarse-grained bulkheads around critical services before optimizing further

The next time you design a distributed system, ask yourself: if this service fails catastrophically, what's the blast radius? If the answer is "everything," you need bulkheads. Your 2:47 AM self will thank you.

What shared resources in your architecture represent single points of failure? Which services would benefit most from isolation? And perhaps most importantly, are you prepared to trade some resource efficiency for the peace of mind that comes from contained failures?

Share this article

Related articles

Spec-Driven Development: Let AI Read the Boring Stuff For You
Architecture PracticesDesign Patterns

Spec-Driven Development: Let AI Read the Boring Stuff For You

Most teams consult specifications when things break. Spec-driven development turns that around, making the spec the source of truth before a single line of code is written. This post covers the four pillars of SDD, a step-by-step Claude Code walkthrough using FHIR validation, the current tooling landscape, and an honest look at where the practice still falls short.

22 min read
When Systems Talk Without Knowing Each Other: Pub-Sub
Design PatternsIntegration

When Systems Talk Without Knowing Each Other: Pub-Sub

A few years ago, I worked on a customer loyalty platform for a large retailer. Every time a customer made a purchase, multiple systems needed to react. The billing system had to record the transaction. The loyalty engine had to calculate points. The marketing platform needed purchase data to trigger campaigns. At first, the team considered wiring these systems directly together. The billing service would call the loyalty service. The loyalty service would notify marketing. And so on. But within a few weeks, the design looked like spaghetti. Every new requirement added another direct dependency. One change in one system caused a ripple effect across the others. We needed a way for systems to talk without being tightly tied together. That’s when the Publisher-Subscriber pattern became the answer.

7 min read
Health Endpoints: When Production Sleeps with One Eye Open
ObservabilityResilience

Health Endpoints: When Production Sleeps with One Eye Open

Picture this. Your team just rolled out a new payments service into production. It’s late Friday night. Everything looks fine at first glance, but suddenly transactions stop flowing. You open the logs and dashboards, but nothing obvious shows up. Was it the database? The service? The network? You don’t know yet. This is where a health endpoint could have been your first line of defense. Instead of digging through half a dozen monitoring tools, you could hit a simple /health URL and instantly know if the service was alive, if its dependencies were reachable, and if it was fit to serve requests.

6 min read

Enjoyed this article?

Subscribe to get more insights delivered to your inbox monthly

Subscribe to Newsletter