Bulkheads: Because One Sinking Service Shouldn't Sink Them All

The first time I watched a single microservice take down an entire platform, I genuinely could not believe what I was seeing. One recommendation engine, badly misconfigured, started holding database connections open and never letting go. Within a few minutes the checkout service ran dry. Then authentication. Then inventory. The monitoring dashboard looked like a fireworks display, and every alert was pointing at a service that, on its own, did not even serve customers directly. We had built a system of supposedly independent services that were, in practice, one big shared pool waiting for something to go wrong.

The embarrassing part is how long it took me to understand why. Each service had its own codebase, its own deployment pipeline, its own team. But they all drank from the same well: shared connection pools, shared thread budgets, shared memory. Independence at the code level meant nothing when the underlying resources were communal. One hungry neighbour could starve everyone else, and there was nothing the other services could do about it.

Think of a submarine. Naval architects don't design submarines as single, open spaces. They divide them into watertight compartments called bulkheads. When a torpedo strikes and water rushes in, the crew can seal off the damaged section, containing the flood to just that compartment. The rest of the submarine remains dry and operational. This principle, born from centuries of maritime engineering, offers a powerful metaphor for software architecture.

In distributed systems, we face a similar challenge. Services share resources: database connection pools, thread pools, memory heaps, network sockets, API rate limits. This sharing seems efficient until one service misbehaves. A poorly optimized query might hold database connections indefinitely. A sudden traffic spike might exhaust all available threads. A memory leak might consume the entire heap. Without proper isolation, these failures cascade through the system like dominoes.

The bulkhead pattern addresses this by introducing intentional segmentation. Instead of allowing all services to draw from a common resource pool, we partition resources into isolated compartments. Each service or group of services gets its own dedicated allocation. When one service fails, it can only consume its own resources. The blast radius is contained. Other services continue operating normally, blissfully unaware that their neighbor is drowning.

The Anatomy of Isolation

Bulkheads operate at multiple levels of your architecture. Understanding where and how to apply them requires thinking about resource types and failure modes.

Resource-Level Bulkheads

At the most fundamental level, bulkheads partition shared resources:

Connection Pools: Instead of a single database connection pool shared by all services, each service gets its own pool with defined limits. Your recommendation service gets 20 connections, your checkout service gets 50, your reporting service gets 10. When recommendations goes haywire and exhausts its 20 connections, checkout still has its full allocation.
Thread Pools: Similarly, thread pools can be partitioned by function. HTTP request handling threads are separate from background job processing threads, which are separate from scheduled task threads. A runaway background job can't starve the system of threads needed to serve user requests.
Memory Allocation: In containerized environments, memory limits act as bulkheads. Each service container gets a defined memory ceiling. One service's memory leak can't consume memory needed by other services.
Circuit Capacity: Rate limiters and circuit breakers create bulkheads around external dependencies. You might allow your system to make 1000 calls per minute to a payment gateway, but partition that capacity: 700 for checkout, 200 for subscription management, 100 for refunds.

Service-Level Bulkheads

Beyond individual resources, bulkheads can isolate entire service instances:

Deployment Isolation: Critical services run on dedicated infrastructure. Your payment processing service doesn't share compute resources with your image optimization service. If image processing consumes all CPU cycles generating thumbnails, payments continue unaffected.
Tenant Isolation: In multi-tenant systems, high-value customers might get dedicated service instances. When a low-priority tenant generates excessive load, it affects only its own bulkhead, not premium customers.
Geographic Isolation: Regional deployments act as natural bulkheads. A failure in your US-East region doesn't cascade to Europe or Asia-Pacific.

The Trade-offs Table

Aspect	Without Bulkheads	With Bulkheads
Resource Efficiency	Higher (shared pools maximize utilization)	Lower (dedicated allocations may sit idle)
Failure Isolation	None (cascading failures common)	Strong (failures contained to bulkhead)
System Complexity	Lower (simpler resource management)	Higher (more configuration and monitoring)
Performance Ceiling	Higher (can burst into unused capacity)	Limited (hard caps prevent resource borrowing)
Operational Overhead	Lower (fewer pools to manage)	Higher (must tune each bulkhead separately)

The key insight: bulkheads trade resource efficiency for resilience. You're intentionally accepting some resource underutilization to prevent catastrophic failures.

Building Your Bulkheads

The following walkthrough builds a small but complete local demo. You will need .NET 8 SDK and Docker Desktop installed. Every command runs in a terminal from the project root.

1. Scaffold the project

Create a minimal Web API and add the Npgsql connection library:

dotnet new webapi -n BulkheadDemo --no-https
cd BulkheadDemo
dotnet add package Npgsql --version 8.*

2. Start a local database

Create a docker-compose.yml at the project root:

services:
  postgres:
    image: postgres:16-alpine
    environment:
      POSTGRES_USER: demo
      POSTGRES_DB: demo
      POSTGRES_HOST_AUTH_METHOD: trust
    ports:
      - "5432:5432"

POSTGRES_HOST_AUTH_METHOD: trust tells Postgres to accept all connections from the local network without a credential. This is fine for a throwaway dev container. Never use it in production.

Start it with:

docker compose up -d

Verify it is running before continuing:

docker compose ps

You should see the postgres container with status running.

3. Add the connection string

Open appsettings.Development.json and add:

{
  "ConnectionStrings": {
    "Default": "Host=localhost;Port=5432;Database=demo;Username=demo;Maximum Pool Size=5"
  }
}

The database credential is injected at runtime via an environment variable override in step 6, so it never has to live in a committed config file. The Maximum Pool Size=5 is intentionally small so you can see the bulkhead fill up quickly in the next steps.

4. Create a semaphore-based bulkhead

This is the core of the pattern. A SemaphoreSlim limits how many concurrent operations a caller can run. When all slots are taken, new callers wait up to a timeout and then get rejected.

Create BulkheadPolicy.cs:

namespace BulkheadDemo;


public sealed class BulkheadPolicy
{
    private readonly SemaphoreSlim _semaphore;
    private readonly string _name;
    private readonly TimeSpan _waitTimeout;


    public BulkheadPolicy(string name, int maxConcurrency, TimeSpan waitTimeout)
    {
        _name = name;
        _semaphore = new SemaphoreSlim(maxConcurrency, maxConcurrency);
        _waitTimeout = waitTimeout;
    }


    public int Available => _semaphore.CurrentCount;


    public async Task<T> ExecuteAsync<T>(Func<Task<T>> operation)
    {
        var acquired = await _semaphore.WaitAsync(_waitTimeout);
        if (!acquired)
            throw new BulkheadRejectedException(
                $"Bulkhead '{_name}' is at capacity. Try again later.");
        try
        {
            return await operation();
        }
        finally
        {
            _semaphore.Release();
        }
    }
}


public class BulkheadRejectedException(string message) : Exception(message);

5. Register two isolated bulkheads

Open Program.cs and replace its contents:

using BulkheadDemo;
using Npgsql;


var builder = WebApplication.CreateBuilder(args);
var connectionString = builder.Configuration.GetConnectionString("Default")!;


// Two separate concurrency limits, backed by the same Npgsql data source.
// Checkout gets 3 slots; recommendations gets 2.
// A runaway recommendations load cannot touch checkout's allocation.
builder.Services.AddKeyedSingleton("checkout",
    new BulkheadPolicy("checkout", maxConcurrency: 3, TimeSpan.FromSeconds(2)));


builder.Services.AddKeyedSingleton("recommendations",
    new BulkheadPolicy("recommendations", maxConcurrency: 2, TimeSpan.FromSeconds(1)));


builder.Services.AddSingleton(_ => new NpgsqlDataSourceBuilder(connectionString).Build());


var app = builder.Build();


// Simulate a fast checkout query
app.MapGet("/checkout", async (
    [FromKeyedServices("checkout")] BulkheadPolicy bulkhead,
    NpgsqlDataSource db) =>
{
    return await bulkhead.ExecuteAsync(async () =>
    {
        await using var conn = await db.OpenConnectionAsync();
        await using var cmd = conn.CreateCommand();
        cmd.CommandText = "SELECT 1";
        await cmd.ExecuteScalarAsync();
        return Results.Ok(new { service = "checkout", status = "ok" });
    });
})
.WithName("Checkout");


// Simulate a slow recommendations query (5-second artificial delay)
app.MapGet("/recommendations", async (
    [FromKeyedServices("recommendations")] BulkheadPolicy bulkhead,
    NpgsqlDataSource db) =>
{
    return await bulkhead.ExecuteAsync(async () =>
    {
        await using var conn = await db.OpenConnectionAsync();
        await Task.Delay(TimeSpan.FromSeconds(5)); // simulate slow query
        return Results.Ok(new { service = "recommendations", status = "ok" });
    });
})
.WithName("Recommendations");


// Health endpoint shows available slots in each bulkhead
app.MapGet("/health", (
    [FromKeyedServices("checkout")] BulkheadPolicy checkout,
    [FromKeyedServices("recommendations")] BulkheadPolicy recommendations) =>
    Results.Ok(new
    {
        checkout = new { available = checkout.Available },
        recommendations = new { available = recommendations.Available }
    }));


app.Run();

6. Run the demo

Start the API:

dotnet run --urls http://localhost:5000

Open a second terminal and fire several concurrent requests at the slow recommendations endpoint:

# Windows PowerShell
$jobs = 1..6 | ForEach-Object {
    Start-Job { 
        try { Invoke-RestMethod http://localhost:5000/recommendations | ConvertTo-Json }
        catch { "REJECTED: $($_.Exception.Message)" }
    }
}
$jobs | Wait-Job | ForEach-Object { Receive-Job $_ }
$jobs | Remove-Job

# macOS / Linux
for i in {1..6}; do curl -s http://localhost:5000/recommendations & done; wait

You will see the first two requests succeed after five seconds. The remaining four are rejected immediately with a BulkheadRejectedException because the recommendations bulkhead only allows two concurrent operations.

Now hit checkout while recommendations is saturated:

# Windows PowerShell
Invoke-RestMethod http://localhost:5000/checkout

# macOS / Linux
curl http://localhost:5000/checkout

Checkout responds instantly. Its bulkhead is completely unaffected by whatever is happening in recommendations. That is the entire point.

Check the health endpoint to see slot counts in real time:

# Windows PowerShell
Invoke-RestMethod http://localhost:5000/health

# macOS / Linux
curl http://localhost:5000/health

7. Observe and tune

The health endpoint is where you add real observability in production. A practical next step is to expose these counters to your metrics system. With .NET's built-in System.Diagnostics.Metrics you can do that without pulling in extra libraries. Add the following to BulkheadPolicy.cs:

private static readonly Meter Meter = new("BulkheadDemo");
private readonly ObservableGauge<int> _availableGauge;

// Inside the constructor, after creating _semaphore:
_availableGauge = Meter.CreateObservableGauge(
    $"bulkhead.{name}.available",
    () => _semaphore.CurrentCount,
    description: "Available slots in this bulkhead");

Prometheus, Grafana, or any OpenTelemetry-compatible backend can then scrape this gauge and alert when a bulkhead's available count drops toward zero, giving you early warning before callers start seeing rejections.

The Captain's Checklist

Implementing bulkheads transforms your architecture from a fragile, interconnected web into a resilient system of isolated compartments:

Partition critical shared resources into dedicated pools per service or function, accepting some inefficiency to prevent cascading failures
Monitor bulkhead utilization continuously and alert when thresholds are approached, treating saturation as an early warning system
Design graceful degradation strategies for when bulkheads reach capacity, whether through queuing, rejection with retry signals, or fallback responses
Balance isolation granularity against operational complexity, starting with coarse-grained bulkheads around critical services before optimizing further

The next time you design a distributed system, ask yourself: if this service fails catastrophically, what's the blast radius? If the answer is "everything," you need bulkheads. Your 2:47 AM self will thank you.

What shared resources in your architecture represent single points of failure? Which services would benefit most from isolation? And perhaps most importantly, are you prepared to trade some resource efficiency for the peace of mind that comes from contained failures?

Bulkheads: Because One Sinking Service Shouldn't Sink Them All

The Anatomy of Isolation

Resource-Level Bulkheads

Service-Level Bulkheads

The Trade-offs Table

Building Your Bulkheads

1. Scaffold the project

2. Start a local database

3. Add the connection string

4. Create a semaphore-based bulkhead

5. Register two isolated bulkheads

6. Run the demo

7. Observe and tune

The Captain's Checklist

Share this article

Related articles

Event Sourcing: A History Lesson Your Database Actually Wants

Spec-Driven Development: Let AI Read the Boring Stuff For You

When Systems Talk Without Knowing Each Other: Pub-Sub

Enjoyed this article?

When Sharing Isn't Caring

The Anatomy of Isolation

Resource-Level Bulkheads

Service-Level Bulkheads

The Trade-offs Table

Building Your Bulkheads

1. Scaffold the project

2. Start a local database

3. Add the connection string

4. Create a semaphore-based bulkhead

5. Register two isolated bulkheads

6. Run the demo

7. Observe and tune

The Captain's Checklist

Share this article

Related articles

Event Sourcing: A History Lesson Your Database Actually Wants

Spec-Driven Development: Let AI Read the Boring Stuff For You

When Systems Talk Without Knowing Each Other: Pub-Sub

Enjoyed this article?