Overview
Quick takeaways
A first-person operational checklist for reading PostgreSQL replication lag, checking WAL pressure, and deciding when the problem is real enough to escalate.
- Check whether replica-backed reads or failover expectations are already at risk.
- Compare the current lag with the normal pattern for that workload window.
- Treat acceleration and instability as more important than a single absolute number.
Section 01
I separate scary dashboards from real risk
Replication lag graphs can look dramatic before users feel anything. So my first question is always whether the lag is affecting the thing the replicas are actually there for: read traffic, failover confidence, or recovery expectations.
I do not want to underreact, but I also do not want to turn every lag spike into a production drama. The best first step is understanding whether the lag is transient, repeating, or growing faster than the system can recover.
- Check whether replica-backed reads or failover expectations are already at risk.
- Compare the current lag with the normal pattern for that workload window.
- Treat acceleration and instability as more important than a single absolute number.
Section 02
Then I look at WAL, I/O, and replay behavior
Once I know the lag matters, I move to the mechanics. Is the primary generating WAL faster than expected? Is the replica slow to receive, slow to write, or slow to replay? Those are very different problems, and mixing them together wastes time.
I usually check disk behavior, network stability, replay delays, and whether the replica is simply being asked to keep up with a workload that changed more quickly than the topology did.
- Break the problem into send, write, flush, and replay stages.
- Look for storage or network pressure before blaming PostgreSQL itself.
- Check whether workload spikes or maintenance jobs created the lag window.
Section 03
The real fix is usually upstream of the replica
Sometimes the replica needs tuning, but often the lasting fix is somewhere earlier in the chain: a bursty write pattern, oversized maintenance work, poor batching, or a primary under more pressure than it should be carrying.
That is why I try to close replication lag incidents with a system-level lesson, not just a replica-level tweak. Otherwise the same graph comes back next week wearing a different disguise.
- Use the lag incident to ask what changed on the primary or in the workload.
- Reduce burstiness where possible instead of endlessly tuning around it.
- Aim for better failover confidence, not just a temporarily smaller lag graph.