Fix ERROR_CLUSTER_POISONED (0X00001713) Fast

You’re staring at a cluster node that won’t come back

Yeah, that ERROR_CLUSTER_POISONED (0X00001713) message is a pain. It means Windows Failover Cluster has marked that node as toxic. Every time it tries to join, the cluster says “nope, you’re done.” I had a client last month whose entire file server cluster went down because one node got poisoned after a storage hiccup. Here’s how to get it back online and stop it from happening again.

The fix: Force-clear the poison flag

Skip all the rebooting and event log staring. The real fix is to remove the node’s poison state using PowerShell. Run this on any healthy node in the cluster (preferably the one that owns the quorum):

Clear-ClusterNode -Node <PoisonedNodeName>

Replace <PoisonedNodeName> with the actual node name. If you’re not sure, run Get-ClusterNode | Where State -eq 'Down' to find it. After that, try starting the Cluster service on the poisoned node:

Start-Service ClusSvc

If the node joins the cluster and stays stable for a few minutes, you’re golden.

When Clear-ClusterNode fails

Sometimes the cmdlet throws an access denied error or says the node is in a bad state. That happened to me on a Server 2016 cluster after a backup software crash. In that case, you need to evict the node entirely and add it back:

On a healthy node, run: Remove-ClusterNode <PoisonedNodeName>
On the poisoned node, stop the Cluster service: Stop-Service ClusSvc
Delete the cluster database files from the poisoned node (yes, manually):
Remove-Item -Path "C:\Windows\Cluster\cluster.*" -Force
Clear any stuck quorum votes: Set-ClusterQuorum -NoWitness (you’ll re-add the witness later)
On the healthy node, add the node back: Add-ClusterNode <PoisonedNodeName>
Reconfigure the witness if needed: Set-ClusterQuorum -FileShareWitness \\path\to\witness

This nukes the node’s local cluster config and forces a fresh join. Do it during a maintenance window because it’ll disrupt cluster operations briefly.

Why the node got poisoned in the first place

Cluster poison detection is a safety mechanism. If a node fails to respond to heartbeats or crashes the Cluster service repeatedly (usually 3 times within a short period), Windows marks it as poisoned. The idea is to stop a flaky node from dragging the whole cluster down. Common triggers:

Storage timeouts – A disk latency spike on the shared storage caused the Cluster service to hang. I saw this on a HyperConverged cluster with cheap SSDs.
Network hiccups – A switch port flapping or a bad NIC driver makes the node appear dead to other nodes.
Memory pressure – The Cluster service runs out of memory and crashes. Usually happens on nodes with < 8 GB RAM for a large cluster.
Third-party software conflicts – Antivirus or backup agents that lock cluster registry keys. One time, Symantec Endpoint Protection was the culprit.

Less common variations of this issue

Node stuck in “Cleaning” state

Sometimes after eviction, the node never fully leaves and shows as “Cleaning” in Failover Cluster Manager. That’s a deadlock. Force it out with:

Get-ClusterNode -Name <NodeName> | Clear-ClusterNode -Force

If that doesn’t work, restart the Cluster service on all nodes (one at a time) to flush the state.

Quorum witness keeps voting the node out

If the witness (file share or cloud) is misconfigured or unreachable from one node, the cluster may think that node is constantly down. Verify connectivity to the witness from the poisoned node. For a file share witness, test with Test-Path \\witness\share. If it fails, reconfigure the quorum to use a different witness or dynamic quorum.

Cluster log shows “Node poison count exceeded” repeatedly

Even after fixing, the node gets poisoned again within minutes. That means the underlying cause (like storage latency) isn’t resolved. Check the cluster log with:

Get-ClusterLog -Destination C:\ClusterLogs -TimeSpanMinutes 30

Look for lines containing “poison” or “timeout”. If you see references to a specific disk CSV (Cluster Shared Volume), run Get-ClusterSharedVolume -Name <CSVName> | Get-ClusterParameter and check the BlockSize or LunReserveSize values. I’ve fixed nodes by increasing the CSV redirect timeout:

(Get-Cluster).BlockedRedirectedAccessTimeout = 30

Default is 5 seconds; bumping it to 30 gives the node more room to recover before being declared dead.

Prevention – keep nodes from getting poisoned again

Monitor storage latency – Set alerts on disk response times over 20 ms for CSV disks. Use PerfMon counters Cluster Shared Volume\Avg. Disk sec/Read and Write.
Update NIC drivers and firmware – Outdated Broadcom or Intel drivers are notorious for causing heartbeats to drop. Keep them current.
Add a dedicated heartbeat network – Use a separate VLAN or physical NIC for cluster communication, with higher priority in the cluster network config. Run Get-ClusterNetwork | Where Role -eq 'Cluster' | Set-ClusterNetwork -Priority 1.
Check event ID 1069 and 1135 – These log cluster resource failures and node removal attempts. Correlate them with storage and network events to catch issues early.
Increase Cluster service crash count threshold – If you have flaky hardware you can’t replace immediately, raise the poison limit with a registry key:
HKLM\Cluster\Parameters\PoisonThreshold (DWORD, default 3, max 10). But this is a band-aid – fix the root cause.

That’s it. No silver bullet, but these steps have saved my bacon on multiple clusters. If you’re still stuck, check the cluster log for “ERROR_CLUSTER_POISONED” and look for the first timeout event. That’ll point you right to the problem component.