Deadlocks in Distributed Systems: Detection, Prevention, Recovery

💻 Programming 👷‍♀️ Software Architecture

25.12.2025 313 0 7 min read en

Deadlocks in Distributed Systems: Detection, Prevention, Recovery

In this article, we explain what distributed deadlocks are, how they differ from blocking and slowness, the main types of distributed deadlocks, and practical strategies to prevent them in real systems.

What Is a Distributed Deadlock?

In a single-process or monolithic application, deadlocks are ugly but predictable. In distributed systems, they are 'quieter', harder to detect, and far more expensive.

A distributed deadlock occurs when multiple independent services wait for each other, preventing any of them from moving forward. Each service is healthy on its own. The system as a whole is stuck.

This makes distributed deadlocks dangerous:

There is no single place where the deadlock "lives".
Logs look normal, services are up, CPUs are idle
Retries often make things worse
Recovery may require manual intervention

Most teams do not explicitly design for deadlocks. They appear later under load, during partial failure, or when a new service dependency is introduced.

Note: Not all waiting is a deadlock. Deadlock means no possible progress, not just slow progress.

Key characteristics of distributed deadlocks

Services are healthy, but workflows stall
Retries increase the load but do not unblock anything
Timeouts cause rollbacks that block other operations
Manual restarts “fix” the problem temporarily
These signals will repeat across all deadlock types we discuss next.

Example of distributed deadlock

Coffman Conditions under which a deadlock occurs

Note: A deadlock can occur only if all four conditions hold.

1. Mutual exclusion

Only one participant can hold a resource at a time.

In distributed systems, this includes:

database row or table locks
distributed locks (Redis, etcd, ZooKeeper)
exclusive access to external APIs
single-writer guarantees

Nothing unusual here. Mutual exclusion is standard.

2. Hold and wait

A participant holds one resource while waiting for another.

This is where distributed systems start to hurt.

3. No preemption

Resources cannot be forcibly taken away.

In distributed systems:

You cannot safely “steal” a DB lock
You cannot interrupt a remote call cleanly
You cannot roll back another service’s transaction

Once a service holds a resource, the system must wait for it to release it voluntarily.

4. Circular wait

A cycle of dependencies exists.

This is the condition everyone knows, but rarely sees clearly in distributed systems.

Types of distributed deadlocks

Resource-Based Deadlocks

Resource-based deadlocks are the most common type of distributed deadlock. They happen when services hold exclusive resources and wait for other resources owned by another service.

What counts as a “resource” in distributed systems

In practice, a resource can be:

Database rows or tables.
Distributed locks (Redis, etcd, ZooKeeper).
Files or object storage handles.
Rate-limited external APIs.
Single-writer business invariants (for example, “only one active order per user”).

If access is exclusive, it can participate in a deadlock.

How to prevent resource-based deadlocks

Never hold locks while making synchronous remote calls. Commit or release resources before calling another service.
Define a global resource ordering. If multiple resources must be locked, always lock them in the same order across services.
Reduce lock scope and duration. Short transactions. Minimal critical sections.
Prefer async workflows for cross-service coordination. Events and message queues break circular waiting.
Design idempotent operations. So retries do not extend lock lifetimes.
Treat business invariants as resources. If an invariant is exclusive, explicitly design it.

Communication Deadlocks

A communication deadlock occurs when services block each other through synchronous communication, even without explicit locks.

Each service waits for a response from another service, forming a cycle of blocking calls. No database locks. No distributed locks. Just waiting on the network.

This makes communication deadlocks especially tricky:

Nothing looks “locked.”
The thread pools are busy but not doing work
Timeouts and retries often amplify the problem

What counts as a "blocked resource" in Communication Deadlocks

In communication deadlocks, the “resource” is implicit:

threads
request slots
connection pools
in-flight request capacity

Once a thread is blocked waiting for a response, it cannot process incoming requests that might unblock others.

How to prevent communication deadlocks

Avoid synchronous call cycles. No service should synchronously depend on itself through other services.
A service should never synchronously depend on itself, even indirectly.
Prefer async communication for workflows. Queues and events break waiting chains.
Separate inbound and outbound thread pools. So, blocked outbound calls do not block inbound requests.
Set strict timeouts and fail fast. Waiting forever is worse than failing early.
Apply bulkheads and concurrency limits. Limit the number of requests a service can block on downstream dependencies.
Keep services behaviorally autonomous. A service should not require another service to complete its own request.

Transactional Deadlocks

Transactional deadlocks occur when distributed transactions or long-running workflows block each other, usually due to coordination protocols or incomplete rollback.

Unlike resource-based deadlocks, these do not always involve direct locks.
Instead, they are caused by transaction participants waiting for a global decision that never arrives.

This deadlock type is common in:

two-phase commit (2PC)
saga-based workflows with synchronous steps
long-running business processes

What acts as a "blocked resource" in Transactional Deadlocks

In transactional deadlocks, the blocked resources are:

transaction participants
prepared but not committed data
reserved business state (money, inventory, quotas)
compensating actions waiting for execution

Once a participant enters a “prepared” or “pending” state, it may block others indefinitely.

Example deadlock in 2PC:

What happens step by step:

Transaction Manager asks all participants to prepare
Services A and B lock resources and respond “ready.”
Transaction Manager becomes unavailable
Participants cannot commit or roll back
Locked state persists indefinitely
No progress is possible without manual recovery.

How to prevent transactional deadlocks

Avoid distributed transactions when possible. Prefer eventual consistency over global atomicity.
Prefer Saga over 2PC. Sagas reduce deadlock risk, but only if steps and compensations are asynchronous and independent.
Make compensation idempotent and independent. A rollback should not depend on another service being available or responsive.
Set clear timeouts for prepared states. Prepared should not mean “forever”.
Persist workflow state explicitly. So recovery can continue after failures.
Design for partial failure first. Assume coordinators and participants will disappear.

Lock Manager Deadlocks

Lock manager deadlocks occur when a centralized or semi-centralized lock service becomes the point where waiting chains form and never resolve.

Instead of services blocking each other directly, they all block through the lock manager.

This type of deadlock is subtle because:

Services do not talk to each other
The deadlock lives “inside” the coordination infrastructure
failures look like infrastructure instability, not logic errors

Typical lock managers:

Redis
etcd
ZooKeeper
Consul

What acts as a "blocked resource" in Lock Manager

In this case, the resources are:

distributed locks themselves
lock leases
sessions with the lock manager
leadership or fencing tokens

If a lock is held but cannot be released safely, everyone else waits.

Example:

What happens step by step:

Service A acquires a distributed lock
Service A crashes or loses the network
The lock is never released
Service B waits indefinitely

If TTLs or fencing are misconfigured, this lock can live forever.

How to prevent lock manager deadlocks

Prefer data-level guarantees over distributed locks
Databases are better lock managers than ad-hoc systems.
Always use lock TTLs. No lock should live longer than the business operation.
Use fencing tokens. So stale lock holders cannot corrupt the state.
Avoid acquiring multiple locks. If unavoidable, enforce strict lock ordering.
Monitor lock duration and contention. Long-held locks are early warning signals.
Treat the lock manager as critical infrastructure. Design for its partial failure explicitly.

Distributed Deadlock Comparison Table

Deadlock Type	Primary "Resource" Being Held	Common Trigger	Prevention Strategy
Resource-Based	Database rows, Redis keys, or API locks	Services holding a lock while waiting for another service	Global resource ordering and avoiding locks during remote calls
Communication	Threads, connection pools, and request slots	Circular synchronous call chains (A > B > A)	Use async communication, strict timeouts, and bulkhead patterns
Transactional	Prepared data or reserved business state	A Transaction Manager failing after participants "Prepare"	Prefer Sagas over 2PC and ensure all rollback actions are idempotent
Lock Manager	Distributed leases or fencing tokens	A service crashing or losing network without releasing a lock	Mandatory Lock TTLs (Time-To-Live) and fencing tokens

Key Takeaways

Most distributed systems do not implement accurate deadlock detection and instead rely on timeouts, retries, and recovery.
Never Hold a Lock During a Network Call: This is the most common cause of resource-based deadlocks. Commit your local transaction before contacting another service.
Use "Timeouts" Everywhere: A communication deadlock happens because something is willing to wait forever. Set strict timeouts for every outbound request.
Design for "Preemption" with TTLs: Since you cannot "steal" a resource back in a distributed system, you must ensure resources expire automatically via Time-To-Live (TTL) settings.
Monitor workflow progress, not just service health. Deadlocks often happen in healthy systems.

Tags:

Antipatterns DDD Event-Driven Architecture Microservices Observability Performance Redis Scalability

Comments:

Please log in to be able add comments.