SQL Server AAG – The Importance of WSFC Quorum Voting.

SQL Server Always on uses the Windows Server Failover Clustering(WSFC) Technology. WSFC uses a quorum-based approach(minimum number of members that must be present to make the proceedings valid) to monitoring overall cluster health and maximize node-level fault tolerance.

Each node in a WSFC cluster participates in periodic heartbeat communication to share the node’s health status with the other nodes. Unresponsive nodes are considered to be in a failed state. A quorum node set is a majority of the voting nodes and witnesses in the WSFC cluster. The overall health and status of a WSFC cluster is determined by a periodic quorum vote. 

If a WSFC cluster is set offline because of quorum failure, manual intervention is required to bring it back online.

So what voting modes can we use?

quorum mode can be configured at the WSFC cluster level, The Failover Cluster Manager utility will recommend a quorum mode based on the number of nodes in the cluster.

  • Node Majority. More than one-half of the voting nodes in the cluster must vote affirmatively for the cluster to be healthy.
  • Node and File Share Majority. Similar to Node Majority quorum mode, except that a remote file share is also configured as a voting witness, and connectivity from any node to that share is also counted as an affirmative vote. More than one-half of the possible votes must be affirmative for the cluster to be healthy.As a best practice, the witness file share should not reside on any node in the cluster, and it should be visible to all nodes in the cluster.
  • Node and Disk Majority. Similar to Node Majority quorum mode, except that a shared disk cluster resource is also designated as a voting witness, and connectivity from any node to that shared disk is also counted as an affirmative vote. More than one-half of the possible votes must be affirmative for the cluster to be healthy.
  • Disk Only. A shared disk cluster resource is designated as a witness, and connectivity by any node to that shared disk is counted as an affirmative vote.

So technically anything could have a vote? a disk, a file share, nodes…

Yes, By default, each node in the WSFC cluster is included as a member of the cluster quorum; each node has a single vote in determining the overall cluster health, and each node will continuously attempt to establish a quorum. 

So in a 2 node cluster could we just have 1 node that votes and the other isn’t allowed?

No, An individual node in a WSFC cluster cannot definitively determine that the cluster as a whole is healthy or unhealthy. At any given moment, from the perspective of each node, some of the other nodes may appear to be offline, or appear to be in the process of failover, or appear unresponsive due to a network communication failure. A key function of the quorum vote is to determine whether the apparent state of each of node in the WSFC cluster is indeed that actual state of those nodes.

So what’s the risk of only having 2 voting nodes?

When more than one voting nodes is able to establish a quorum on its own, that is known as a split-brain scenario. In such a scenario, the nodes in the separate quorums may behave differently, and in conflict with one another. this scenario is more likely when performing a forced failover; explicitly subdividing the quorum node set.

If split-brain Occurs, you’re more than likely going to see problems at AAG Level. you might see that more than 1 nodes thinks it’s primary. this can cause databases to become out of sync, in-accessible and transaction log build up might occur. eventually, if not right away you’re going to see outages.

So what’s the recommended was to achieve quorum in a 2 node cluster?

The easiest way is a file share witness. this is easily configurable, other options are;

  • Add a 3rd node(which would be relative waste of resources)
  • A shared disk (requires setting up a clustered disk)
  • A Cloud witness (Possible Security risks, requires cloud access, may incur additional costs)

Note, You’re always looking to achieve an odd number of votes. if you have more than 2 nodes in your cluster you can start looking at setting the node voting weight and you may be able to achieve a reliable quorum this way.

Cluster Resource ” In Clustered Service or Application Failed – AAG Network Interface Failure

When trying to failover your Always On Availability Group you get the error message “Cluster Resource ‘<ResourceName>’ In Clustered Service or Application ‘<AAGName>’Failed.

Okay? So what happened?

My Client received this error when trying to fail over to their DR Site. there were 4 Availability groups on the same server. It’s multi subnet so there is a second IP address specified for the Always On Availability group Listener.

Always reserve your IP’s

The problem came when the second IP attempted to come online and subsequently failed. With the Server having 4 AAG’s, there are 4 Listeners configured.

  1. On the primary site, there are 4 IP addresses (Yep one for each Listener)
  2. On the secondary site, there should be another 4 IP Addresses configured

For the Secondary Site closer inspection showed that the same IP had been designated for each Listener.

Give that listener it’s own IP

This means, when we started the failover to DR, the first AAG came up okay. It started the network interface for the IP Address and worked correctly. however the IP Address is now in use, and as the other AAG’s failed over, their network interfaces gave us the above error.

Great! now what?

Thankfully we don’t have to recreate the listener, and we can resolve the issue with no outage. All that needs to be done is to alter the IP Address in Failover Cluster Manager;

  • Open Failover Cluster Manager
  • Expand Services and Applications
  • For each AAG Right Click the Offline (Duplicate) IP Click Properties
  • Update the Static IP Address with a new valid IP

Always make sure IP Addresses you’ve assigned are reserved so they don’t get used elsewhere. if an IP Address is already in use when WSFC Tries to bring that resource online, you will get the error message; “Cluster Resource ‘<ResourceName>’ In Clustered Service or Application ‘<AAGName>’ Failed

SQL Server: Always on replication cannot connect to the availability group

Always on replication cannot connect to the availability group; SQL Server uses an ‘End Point’ To talk to other SQL Servers Replicas within an Always On Availability Group(AAG). The most common problem i have seen is firewalls blocking the port for SQL to talk between servers on Always On Availability Group(AAG), The AAG port needs to be open and remain open for SQL to work with AAG’s. By default,  Standard SQL Server connections works on port 1433,  but AAG’s do not talk to each other on this port. you’re looking for the End Point Listener port, by default this is 5022.

You can check which port your endpoint is using;

select name, port, is_dynamic_port, state from sys.tcp_endpoints

This also tells you if the endpoint is stopped(1) or started(0)

There is a simple way to see if SQL Server is listening on this port; from the secondary you are having an issue with do the following;

  1. Start CMD (RUN > CMD)
  2. Type; Telnet <Primary AAG IP Address> <SPACE> <Port Number>

for example;

TELNET 192.168.0.1 5022

If telnet works you will see a black screen, if it doesn’t you will get a connection failed message. Simple?

If it works, your secondary server should connect okay you’re likely seeing a different issue, if it doesn’t either the port is blocked OR the endpoint isn’t running. you will find the endpoint in SSMS under;

Server Objects > Endpoints > Database Mirroring

aag

By Default AAG uses the “Hadr_endpoint” name.

If the endpoint isn’t there, you should create it, again;

CREATE ENDPOINT [Hadr_endpoint]
STATE=STARTED
AS TCP (LISTENER_PORT = 5022, LISTENER_IP = ALL)
FOR DATA_MIRRORING (ROLE = ALL, AUTHENTICATION = WINDOWS NEGOTIATE
, ENCRYPTION = REQUIRED ALGORITHM AES)
GO

You have checked the firewalls

What next? try restarting the endpoint;

use master
GO
alter endpoint [Hadr_endpoint] state = stopped;
GO
alter endpoint [Hadr_endpoint] state = started;
GO

If that fails you could try recreating the endpoint;

USE [master]
GO

DROP ENDPOINT [Hadr_endpoint]
GO

CREATE ENDPOINT [Hadr_endpoint]
STATE=STARTED
AS TCP (LISTENER_PORT = 5022, LISTENER_IP = ALL)
FOR DATA_MIRRORING (ROLE = ALL, AUTHENTICATION = WINDOWS NEGOTIATE
, ENCRYPTION = REQUIRED ALGORITHM AES)
GO

I cannot stress enough that you should be checking the firewall ports before recreating or restarting the endpoint. especially if you’re running a multi node AAG, if one node is working fine, it’s likely not an endpoint issue. it’s more likely network/firewall.