This post talks about an error that occurs when running replication in AlwaysOn Availability Groups cluster. We will also see the possible causes of this error, how to troubleshoot it, and how to go about preventing it.
Here is a scenario. You have two SQL Server 2017 instances in an Availability Groups cluster. A few databases of hundreds of Gigabytes size are currently replicated. However, recently, you started seeing the following errors frequently in the errorlog:
AlwaysOn Availability Groups connection with primary database terminated for secondary database 'a' on the availability replica with Replica ID: 2
AlwaysOn Availability Groups connection with primary database terminated for secondary database 'b' on the availability replica with Replica ID: 2
AlwaysOn Availability Groups connection with primary database terminated for secondary database 'c' on the availability replica with Replica ID: 2
Then followed by the AlwaysOn Availability Groups connection resuming operations normally after a few minutes.
We will outline the investigation strategy supported by any tools or queries used.
This error is caused by the affected databases going offline and then coming back online after few minutes and sometimes hours. When I checked SQL Server service and found out that it is still on, I realized that this issue can be tied to the AlwaysOn configuration and inner working setup comparatively with how replication technology works.
I did further research on how AlwaysOn works. It relies on Windows Failover Cluster technology to know when things are going well. All participating servers use quorum - a voting mechanism - to know which among them will be in active mode and which will be in passive mode.
However, all the servers need to see the network in order to see other servers/nodes in the network. This means that when the network connection of an instance terminates for even a brief moment, the databases in the availability groups rolls back their transactions and go offline.
This explains the additional error message like the one below that usually occur immediately before the above posted messages.
Nonqualified transactions are being rolled back in database A for an AlwaysOn Availability Groups state change. Estimated rollback completion: 100%. This is an informational message only. No user action is required.
This behavior of AlwaysOn is in contrast with that of replication which leaves SQL Server databases running even when there is a drop in network.
What steps can we take to ensure that these errors do not occur in the future?
One way to prevent this issue from happening again is to connect SQL Servers in the replicated environment to two separate network switches such that when one network fails, the server instances can continue talking to each other with the other network.
More from @sqldibia