Deadlocks in Distributed Systems: Detection, Prevention, Recovery

0 12 7 min read en

In this article, we explain what distributed deadlocks are, how they differ from blocking and slowness, the main types of distributed deadlocks, and practical strategies to prevent them in real systems.

What Is a Distributed Deadlock?

In a single-process or monolithic application, deadlocks are ugly but predictable. In distributed systems, they are 'quieter', harder to detect, and far more expensive.

A distributed deadlock occurs when multiple independent services wait for each other, preventing any of them from moving forward. Each service is healthy on its own. The system as a whole is stuck.

This makes distributed deadlocks dangerous:

  • There is no single place where the deadlock "lives".
  • Logs look normal, services are up, CPUs are idle
  • Retries often make things worse
  • Recovery may require manual intervention

Most teams do not explicitly design for deadlocks. They appear later under load, during partial failure, or when a new service dependency is introduced.

Note: Not all waiting is a deadlock. Deadlock means no possible progress, not just slow progress.

Key characteristics of distributed deadlocks

  • Services are healthy, but workflows stall
  • Retries increase the load but do not unblock anything
  • Timeouts cause rollbacks that block other operations
  • Manual restarts “fix” the problem temporarily
  • These signals will repeat across all deadlock types we discuss next.

Example of distributed deadlock

Example of distributed deadlock

Coffman Conditions under which a deadlock occurs 

Note: A deadlock can occur only if all four conditions hold.

1. Mutual exclusion

Only one participant can hold a resource at a time.

In distributed systems, this includes:

  • database row or table locks
  • distributed locks (Redis, etcd, ZooKeeper)
  • exclusive access to external APIs
  • single-writer guarantees

Nothing unusual here. Mutual exclusion is standard.

2. Hold and wait

A participant holds one resource while waiting for another.

This is where distributed systems start to hurt.

3. No preemption

Resources cannot be forcibly taken away.

In distributed systems:

  • You cannot safely “steal” a DB lock
  • You cannot interrupt a remote call cleanly
  • You cannot roll back another service’s transaction

Once a service holds a resource, the system must wait for it to release it voluntarily.

4. Circular wait

A cycle of dependencies exists.

This is the condition everyone knows, but rarely sees clearly in distributed systems.

Example of distributed deadlock

Types of distributed deadlocks

Resource-Based Deadlocks

Resource-based deadlocks are the most common type of distributed deadlock. They happen when services hold exclusive resources and wait for other resources owned by another service.

What counts as a “resource” in distributed systems

In practice, a resource can be:

  • Database rows or tables.
  • Distributed locks (Redis, etcd, ZooKeeper).
  • Files or object storage handles.
  • Rate-limited external APIs.
  • Single-writer business invariants (for example, “only one active order per user”).

If access is exclusive, it can participate in a deadlock.

How to prevent resource-based deadlocks

  1. Never hold locks while making synchronous remote calls. Commit or release resources before calling another service.
  2. Define a global resource ordering. If multiple resources must be locked, always lock them in the same order across services.
  3. Reduce lock scope and duration. Short transactions. Minimal critical sections.
  4. Prefer async workflows for cross-service coordination. Events and message queues break circular waiting.
  5. Design idempotent operations. So retries do not extend lock lifetimes.
  6. Treat business invariants as resources. If an invariant is exclusive, explicitly design it.

Communication Deadlocks

A communication deadlock occurs when services block each other through synchronous communication, even without explicit locks.

Each service waits for a response from another service, forming a cycle of blocking calls. No database locks. No distributed locks. Just waiting on the network.

Communication Deadlocks

This makes communication deadlocks especially tricky:

  • Nothing looks “locked.”
  • The thread pools are busy but not doing work
  • Timeouts and retries often amplify the problem

What counts as a "blocked resource" in Communication Deadlocks

In communication deadlocks, the “resource” is implicit:

  • threads
  • request slots
  • connection pools
  • in-flight request capacity

Once a thread is blocked waiting for a response, it cannot process incoming requests that might unblock others.

How to prevent communication deadlocks

  1. Avoid synchronous call cycles. No service should synchronously depend on itself through other services.
  2. A service should never synchronously depend on itself, even indirectly.
  3. Prefer async communication for workflows. Queues and events break waiting chains.
  4. Separate inbound and outbound thread pools. So, blocked outbound calls do not block inbound requests.
  5. Set strict timeouts and fail fast. Waiting forever is worse than failing early.
  6. Apply bulkheads and concurrency limits. Limit the number of requests a service can block on downstream dependencies.
  7. Keep services behaviorally autonomous. A service should not require another service to complete its own request.

Transactional Deadlocks

Transactional deadlocks occur when distributed transactions or long-running workflows block each other, usually due to coordination protocols or incomplete rollback.

Unlike resource-based deadlocks, these do not always involve direct locks.
Instead, they are caused by transaction participants waiting for a global decision that never arrives.

This deadlock type is common in:

  • two-phase commit (2PC)
  • saga-based workflows with synchronous steps
  • long-running business processes

What acts as a "blocked resource" in Transactional Deadlocks

In transactional deadlocks, the blocked resources are:

  • transaction participants
  • prepared but not committed data
  • reserved business state (money, inventory, quotas)
  • compensating actions waiting for execution

Once a participant enters a “prepared” or “pending” state, it may block others indefinitely.

Example deadlock in 2PC:

What happens step by step:

  1. Transaction Manager asks all participants to prepare
  2. Services A and B lock resources and respond “ready.”
  3. Transaction Manager becomes unavailable
  4. Participants cannot commit or roll back
  5. Locked state persists indefinitely
  6. No progress is possible without manual recovery.

How to prevent transactional deadlocks

  1. Avoid distributed transactions when possible. Prefer eventual consistency over global atomicity.
  2. Prefer Saga over 2PC. Sagas reduce deadlock risk, but only if steps and compensations are asynchronous and independent.
  3. Make compensation idempotent and independent. A rollback should not depend on another service being available or responsive.
  4. Set clear timeouts for prepared states. Prepared should not mean “forever”.
  5. Persist workflow state explicitly. So recovery can continue after failures.
  6. Design for partial failure first. Assume coordinators and participants will disappear.

Lock Manager Deadlocks

Lock manager deadlocks occur when a centralized or semi-centralized lock service becomes the point where waiting chains form and never resolve.

Instead of services blocking each other directly, they all block through the lock manager.

This type of deadlock is subtle because:

  • Services do not talk to each other
  • The deadlock lives “inside” the coordination infrastructure
  • failures look like infrastructure instability, not logic errors

Typical lock managers:

What acts as a "blocked resource" in Lock Manager

In this case, the resources are:

  • distributed locks themselves
  • lock leases
  • sessions with the lock manager
  • leadership or fencing tokens

If a lock is held but cannot be released safely, everyone else waits.

Example:

Lock Manager Deadlocks

What happens step by step:

  1. Service A acquires a distributed lock
  2. Service A crashes or loses the network
  3. The lock is never released
  4. Service B waits indefinitely

If TTLs or fencing are misconfigured, this lock can live forever.

How to prevent lock manager deadlocks

  1. Prefer data-level guarantees over distributed locks
    Databases are better lock managers than ad-hoc systems.
  2. Always use lock TTLs. No lock should live longer than the business operation.
  3. Use fencing tokens. So stale lock holders cannot corrupt the state.
  4. Avoid acquiring multiple locks. If unavoidable, enforce strict lock ordering.
  5. Monitor lock duration and contention. Long-held locks are early warning signals.
  6. Treat the lock manager as critical infrastructure. Design for its partial failure explicitly.

Distributed Deadlock Comparison Table

Deadlock TypePrimary "Resource" Being HeldCommon TriggerPrevention Strategy
Resource-BasedDatabase rows, Redis keys, or API locksServices holding a lock while waiting for another serviceGlobal resource ordering and avoiding locks during remote calls
CommunicationThreads, connection pools, and request slotsCircular synchronous call chains (A > B > A)Use async communication, strict timeouts, and bulkhead patterns
TransactionalPrepared data or reserved business stateA Transaction Manager failing after participants "Prepare"Prefer Sagas over 2PC and ensure all rollback actions are idempotent
Lock ManagerDistributed leases or fencing tokensA service crashing or losing network without releasing a lockMandatory Lock TTLs (Time-To-Live) and fencing tokens

Key Takeaways

  • Most distributed systems do not implement accurate deadlock detection and instead rely on timeouts, retries, and recovery.
  • Never Hold a Lock During a Network Call: This is the most common cause of resource-based deadlocks. Commit your local transaction before contacting another service.
  • Use "Timeouts" Everywhere: A communication deadlock happens because something is willing to wait forever. Set strict timeouts for every outbound request.
  • Design for "Preemption" with TTLs: Since you cannot "steal" a resource back in a distributed system, you must ensure resources expire automatically via Time-To-Live (TTL) settings.
  • Monitor workflow progress, not just service health. Deadlocks often happen in healthy systems.

Comments:

Please log in to be able add comments.