image

Microsoft SQL Server Always On with HPE SimpliVity


image


Executive summary

The HPE SimpliVity hyperconverged infrastructure solution was designed from the ground up to meet the stringent price-performance, scalability, and resiliency demands of today’s data-intensive applications and highly virtualized IT environments. The HPE SimpliVity solution revolutionizes the data center, by virtualizing data and assimilating all IT infrastructure and services below the hypervisor into x86 building blocks.

This technical white paper presents three scenarios for implementing Microsoft® SQL Server Availability Groups on HPE SimpliVity hyperconverged infrastructure. We provide specific examples of the HPE SimpliVity configurations that support real-world highly available Microsoft SQL Server deployments in virtualized environments; deployments that meet the reliability, performance, scalability, and data protection requirements of mid-to-large enterprises. We also highlight the unique HPE SimpliVity capabilities that complement and enhance SQL Server.

The document is aimed at database administrators (DBAs) who are responsible for managing SQL Server availability and their counterparts who are responsible for the successful ongoing operation of the infrastructure. This includes individuals who work with VMware®, data center operations, and data protection.

The goal is to demonstrate HPE SimpliVity as the base platform for highly available production customer SQL Server environments. Specifically, to demonstrate the scalability of multiple SQL Server instances on an HPE SimpliVity node configured with test databases in a SQL Server Availability Group.

SQL Server availability overview

Multiple SQL Server availability scenarios were built to simulate real-world conditions in a moderate online transaction processing (OLTP) database workload. The primary goal of SQL Server Availability Groups is to maintain the availability of the data for multi-user application environments to enable high availability and disaster recovery.

The testing covered three primary scenarios across two primary HPE SimpliVity configurations.

  • Local high availability: Two SQL Server virtual machines configured for local high availability in synchronous-commit mode.

  • Disaster recovery: Two SQL Server virtual machines configured for remote disaster recovery in asynchronous-commit mode.

  • High availability and disaster recovery: Three SQL Server virtual machines configured for both local high availability and remote disaster recovery in synchronous-commit mode for the local replicas and asynchronous-commit mode for the remote replicas.


    This document highlights the core benefits of the HPE SimpliVity solution, namely:

  • Hyperconvergence: A single shared resource pool that abstracts applications and VMs from the underlying hardware across not just server, storage, and network but also backup, disaster recovery, WAN optimization, and cloud enablement.

  • HPE Data Virtualization Platform: HPE SimpliVity’s differentiation from other hyperconverged infrastructure is in its HPE Data Virtualization Platform, which enables HPE SimpliVity hyperconverged infrastructure’s single shared resource pool across multiple sites, as well as provides highly efficient data storage and mobility. The HPE Data Virtualization Platform consists of the HPE SimpliVity data architecture and accelerator card, as well as enables accelerated data efficiency, global unified management, and built-in data protection.

  • Global unified management: The VM-centric approach of HPE OmniStack to management eliminates manually intensive, error-prone administrative tasks. System administrators are no longer required to manage LUNs and volumes; instead, they can manage all resources and workloads centrally, using familiar interfaces such as VMware vCenter® and VMware vRealize® Automation™.


    HPE OmniStack overview

    The HPE SimpliVity hyperconverged infrastructure solution transforms the data center by virtualizing data and incorporating all IT infrastructure and services below the hypervisor into highly scalable and efficient x86 building blocks. With 69% cost savings on infrastructure and

    operations vs. traditional environments, HPE SimpliVity hyperconverged infrastructure delivers the best of both worlds—the enterprise-class performance, protection, and resiliency that today’s organizations require, with the cloud economics businesses demand.1


    1 The Total Economic Impact Of HPE SimpliVity Hyperconverged Infrastructure, Forrester Total Economic Impact Report, August 2017


    Designed to work with any hypervisor or industry-standard x86 server platform, the HPE SimpliVity solution provides a single, shared resource pool across the entire IT stack, eliminating point products and inefficient siloed IT architectures. The solution is distinguished from other converged infrastructure solutions by three unique attributes such as accelerated data efficiency, built-in data protection functionality, and global unified management capabilities.

    An individual HPE OmniStack node includes:

  • A compact hardware platform—A 2U industry-standard virtualized x86 platform containing compute, memory, performance-optimized SSDs and capacity-optimized HDDs protected in hardware RAID configurations, and 10GbE network interfaces.

  • A hypervisor such as VMware vSphere®/VMware ESXi™.

  • The HPE OmniStack virtual controller software running on the hypervisor.

  • HPE OmniStack Accelerator Card—A special-purpose PCIe card with an FPGA, flash, and DRAM, protected with supercapacitors; the accelerator card offloads CPU-intensive functions such as data compression, deduplication, and optimization from the x86 processors.


    image


    Figure 1. Legacy comparison


    Organizations large and small are taking advantage of these functionalities for their SQL Server environments, deploying robust, highly available, and scalable infrastructure solutions while achieving radical TCO savings. Availability Groups are designed to have multiple copies of data, one for each replica. HPE OmniStack reduces the data footprint through its native in-line data deduplication and compression, which saves valuable space while maintaining the performance of the platform.

    Database business continuity

    OLTP data is absolutely critical to many businesses and requires absolute protection. Multiple factors need to be considered when designing a SQL Server deployment to achieve that goal. To fully protect your data, you have to have a complete risk management strategy in place.

    SQL Server’s advanced business continuity features can complement the HPE OmniStack data protection capabilities, satisfying stringent enterprise SLA requirements.

    Mean time to recovery

    The primary goal of any business continuity plan is to minimize unplanned downtime and maximize application and data availability. Mean time to recovery (MTTR) is the actual amount of time that a system takes to return to operation after a failure.

    Server virtualization solutions are designed to ensure high availability in the event of physical server failures. Should a server fail, VMs are automatically reinitiated on another physical server. VM failover can take several minutes or longer, disrupting business-critical applications. SQL Server Always On functionality provides application-level resiliency, which enables dramatically faster failover.


    Recovery point and recovery time objectives

    Database risk management strategies usually include recovery point objectives (RPOs) and recovery time objectives (RTOs) for an application stack, which includes the application servers, database servers, and the data itself. The two metrics can be achieved differently depending on the scenario. For instance, managing a physical server component failure will require a more limited scope than dealing with a major fire in the primary data center.

    The RPO is defined as the amount of granularity and data loss potential of the platform backups, and is usually measured in minutes or hours. Strategies for RPO management can include frequent VM-level backups or SQL Server transaction log backups.

    The RTO is defined as the target maximum time window needed to restore the application stack to operational status and is usually measured in minutes or hours as well. Strategies for RTO management can include reducing system complexity, automation, increasing performance of the backup medium, and frequent testing scenarios.

    As the RPO and RTO windows are reduced, the complexity of the solution designed to meet the requirements becomes greater. Sometimes the level of complexity of the risk management solution actually contributes to greater downtime of the environment. If outages occur, this high-level complexity can cause the triage procedures to be longer and require a greater amount of effort for remediation.

    HPE OmniStack is architected to reduce the complexity of the risk management solution, while enabling IT to dramatically improve RPO and RTO conformance. HPE OmniStack supports multiple methods to enable high availability for data centers and branch offices. It also supports VM replication for simple and reliable disaster recovery.

    SQL Server Availability Groups

    SQL Server Always On Availability Groups are the flagship availability feature within the enterprise edition of the SQL Server database platform. Availability Groups are intended to enable more control over the high availability and disaster recovery aspects of enterprise databases.

    Availability Groups enhance the availability features of SQL Server to improve benefits around:

  • Leveraging Windows Server® Failover Clusters (WSFC) to perform failovers—without the shared storage requirement.

  • Combining synchronous and asynchronous data mirroring for architectural flexibility.

  • Logically grouping specified databases into an Availability Group.

  • Allowing secondary replicas to perform read-only activities, such as reporting queries and backups, to reduce the load on the primary replica.

  • Providing a straightforward solution for multi-subnet failover and failback processes.


    An Availability Group can logically group together application-dependent databases into one logical unit that fail over together in mere seconds, ensuring that the outage is kept to a minimum. An Availability Group includes one primary replica node and up to eight secondary replicas. An Availability Group fails over to another replica, which can be at either the primary data center or in a secondary data center.

    Each Availability Group replica is configured for either synchronous- or asynchronous-commit mode. Synchronous-commit mode forces

    SQL Server to wait for the all replicas to commit a transaction to disk before acknowledging that the command has completed successfully to the application. Asynchronous-commit mode does not wait for the acknowledgement of a successful commit from the secondary replicas before returning an acknowledgement to the application.

    Each Availability Group can be configured with a listener, which the application can connect to. The Availability Group listener’s DNS entry stays the same, and no application connection string changes are required for the application to reconnect after failover. A failover to a secondary subnet at a disaster recovery site will not cause a change to the connection string, as the architecture can accommodate

    multi-subnet failovers by updating the listener DNS entry accordingly.

    For more information about Availability Groups, read more at msdn.microsoft.com/en-us/library/hh510230.aspx.

    HPE OmniStack in SQL Server environments

    The HPE OmniStack architecture provides a redundant and high-performing scale-out virtual infrastructure capable of running your business-critical SQL Server workloads.

    Central to the HPE OmniStack architecture is high availability. If a host’s hardware were to fail, the VMs that were on the downed host get restarted on the remaining HPE SimpliVity nodes, minimizing the outage window. This level of protection is adequate for many business applications. Some business-critical workloads such as OLTP may impose more stringent SLAs. By leveraging SQL Server Availability Groups


    on HPE OmniStack you can enable application layer resiliency for faster failover, with virtually no service disruption. HPE OmniStack’s data backup and recovery capabilities provide additional protection in the event of catastrophic failures.

    HPE OmniStack in SQL Server Availability Groups

    SQL Server Availability Groups complement HPE OmniStack’s data protection features, providing accelerated failover in the event of planned or unplanned outages.

    To improve the availability of all the databases in a single data center, Availability Group replicas can be placed on individual HPE SimpliVity nodes. A single host failure can trigger an Availability Group failover to another protected replica on another HPE SimpliVity node; the outage window is now measured in seconds instead of minutes.

    While the databases are protected at the primary data center, the disaster recovery strategy for data can also be enhanced by extending an Availability Group from the primary site to asynchronous replicas running at a secondary data center. This disaster recovery feature can be used on its own or in combination with highly available replicas at the primary data center.

    The added flexibility of Availability Groups on HPE OmniStack means an organization can leverage individual features of both HPE OmniStack and SQL Server to design a business continuity strategy that best aligns with the organization’s specific business requirements.

    Solution architecture

    Through many detailed and rigorous lab tests, HPE SimpliVity has developed and validated a technical white paper for deploying and managing Microsoft SQL Server on the HPE SimpliVity hyperconverged infrastructure. This section provides the high-level details of the white paper. The subsequent section describes the detailed testing methodology and results.

    Solution topology

    The validated technical white paper includes HPE SimpliVity 380 nodes, VMware vSphere, Windows Server operating systems, and SQL Server Enterprise edition.

    HPE OmniStack provides tremendous value in the resiliency and availability built into the core architecture. If a high availability event were to occur, the VMs residing on the host would be restarted one or more of the remaining hosts in the HPE SimpliVity federation. On average, the outage duration experienced by a given SQL Server VM is approximately two minutes and 45 seconds while the VM restarts and

    SQL Server completes the crash recovery cycle.

    However, for some organizations, the availability of business-critical data for key systems might require a shorter outage window for both planned and unplanned outages. SQL Server’s Availability Group features can be leveraged to accelerate failovers.

    Three different availability architectures are available with Availability Groups: high availability, disaster recovery, as well as high availability and disaster recovery.

    Availability Groups for high availability

    SQL Server Availability Groups can accelerate failovers and minimize service disruptions for mission-critical applications.

    Two or more instances can be configured in an Availability Group at the primary data center. (Data center 1) Eight secondary replicas can be configured totally in an Availability Group and two of these replicas can be configured in a synchronous manner. For the best possible availability and least risk of data loss, a minimum of two Availability Group replicas can be configured with synchronous data replication within the primary data center. Automatic failover can be configured so that if one replica experiences a problem, the Availability Group fails over automatically and reduces the outage impact.

    Figure 2 shows the basic logical components and connectivity used for the hosts, storage, and network for this Availability Group architecture for high availability.



    image


    Figure 2. Highly available Availability Group on HPE SimpliVity architecture


    If a secondary HPE SimpliVity federation is used for disaster recovery purposes, these Availability Group VMs at the primary data center are actively replicated to the secondary data center to provide a recovery solution for the database platform if the primary site is disrupted.

    Availability Groups for disaster recovery

    SQL Server Availability Groups can minimize service disruptions at the database level in the event of disasters or outages. In the event of an outage or disaster, the secondary Availability Group replica at the secondary data center can be promoted to primary and active in a matter of seconds. This rapid promotion of the Availability Group replica enables customers to meet stringent RTOs that might not be supported by promotion of the replica at the VM-level.

    Figure 3 shows the basic logical components and connectivity used for the hosts, storage, and network for this Availability Group architecture for disaster recovery.



    image


    Figure 3. Availability Group for disaster recovery on HPE SimpliVity 380 architecture


    Availability Groups for both high availability and disaster recovery

    SQL Server Availability Group architectures can be combined to provide both SQL Server high availability and disaster recovery management. Multiple highly available replicas can be managed at the primary data center, while secondary replicas can be promoted to active and made available within seconds at the secondary data center, should the need arise.

    Figure 4 shows the basic logical components and connectivity used for the hosts, storage, and network for this Availability Group architecture for both high availability and disaster recovery.



    image


    Figure 4. Availability Group for high availability and disaster recovery on HPE SimpliVity 380 architecture


    Solution configuration

    HPE OmniStack is very flexible in its deployment configurations. The following architecture represents just one of the many possible deployment scenarios.

    HPE SimpliVity configuration

    Figure 5 depicts a common HPE SimpliVity 380 implementation for a single-site environment. Four HPE SimpliVity 380 nodes are deployed in two data centers as part of the same federation. This 2+2 configuration provides high availability and local backup recovery in each site, as well as remote disaster recovery. All HPE SimpliVity 380 nodes and their associated VMs are managed from a single pane-of-glass using the familiar VMware vCenter Server® management web client. The HPE SimpliVity validation testing demonstrated this configuration at a single-site SQL Server environment.

    Network

    Four HPE SimpliVity 380 systems were configured with both 10GbE and 1GbE interfaces (Figure 5). Redundant cabling was used to protect against interface or facility failures. In a production environment, the best practice is to deploy multiple network switches in a redundant configuration to protect against the loss of a single network switch.



    image


    Figure 5. HPE SimpliVity 380 2+2 redundant network configuration


    Virtual machines

    The test virtual machines in the environment consist of a domain controller, three SQL Server VMs for use for both synchronous and asynchronous AG testing, a file share witness target for the WSFC component, and a test virtual machine to initiate the HammerDB workload to simulate OLTP traffic for the failover testing.


    Table 1. Tested Virtual Machine and purpose

    Server name

    Purpose

    AGDC

    Domain controller

    AGFSW

    File share witness for WSFC quorum functionality

    AGSQL01

    Primary SQL Server Availability Group replica (high-availability configuration)

    AGSQL02

    Secondary SQL Server Availability Group replica (high-availability configuration)

    AGSQL03

    Secondary SQL Server Availability Group replica (disaster recovery configuration)

    Jumpbox

    Utility virtual machine used to initiate the HammerDB load testing


    All the VMs were statically assigned IP addresses. Both the network interfaces on the WSFC and Availability Group traffic networks were assigned IP addresses on a different VLAN and subnet.


    Table 2. Test configuration IP address allocation

    Server

    Management

    Cluster

    Availability Group replication

    AGDC

    10.11.12.1

    N/A

    N/A

    AGFSW

    10.11.12.2

    N/A

    N/A

    AGSQL01

    10.11.12.11

    10.11.13.11

    10.11.14.11

    AGSQL02

    10.11.12.12

    10.11.13.12

    10.11.14.12

    AGSQL03

    10.11.12.13

    10.11.13.13

    10.11.14.13

    Jumpbox

    10.11.12.3

    N/A

    N/A


    The cluster, “AGSQLCL01,” was configured to use the virtual IP address 10.11.12.20 and comprised three VMs—AGSQL01, AGSQL02, and AGSQL03.


    image


    SQL Server application configuration

    All SQL Server virtual machines were configured with a default instance of Microsoft SQL Server. The instance-level configuration details are as follows:

  • SQL Server 2014 Enterprise Edition, Service Pack 1 Cumulative Update 2

  • Minimum and maximum SQL Server memory set to 75% of VM memory allocation

  • Optimize for ad-hoc workloads enabled

  • TPC-C database set to autogrow data and log files by 128 MB

  • Instant file initialization enabled

  • MaxDOP set to the number of vCPUs in a vNUMA node

  • Cost threshold for parallelism set to 25

  • Number of TempDB data files equal to the number of vCPUs (maximum of eight)


    SQL Server Availability Group configuration

    An Availability Group called AGSQLAG01 was created, comprising all three SQL Server instances. This Availability Group contains the HammerDB test database, named “TPC-C.” It is configured for a TCP listener called AGSQLLS01 (DNS name), on the IP address 10.11.12.21 on port 1433. This listener directs traffic to the primary AG node.


    image


    The Availability Group is configured to leverage synchronous data replication between AGSQL01 and AGSQL02, to be used for high availability on one active data center. AGSQL03 is configured for asynchronous data replication to simulate the replication of data to a disaster recovery facility.


    SQL Server licensing

    To configure Availability Groups, SQL Server Enterprise Edition is required. Two licensing models are available for enterprise edition: per-VM or per-host CPU core licensing. Consult your Microsoft licensing specialist for support in ensuring that your implementation is properly licensed.

    Workload specifications

    For each SQL Server availability testing scenario, HammerDB 2.18 was used to simulate a normal OLTP workload stream against the SQL Server Availability Group listener.

    HammerDB is an open source database performance testing utility that executes unofficial TPC-C and TPC-H benchmarks against multiple database platforms, including SQL Server, Oracle, MySQL, and PostgreSQL. It is one of the standard measurements for OLTP database performance.

    It generates a sample data set into a database and then executes a workload against the database. The output is a simple number—transactions per minute—that can be used to demonstrate the relative performance of a database server. The metrics are not especially useful when stand-alone, but when compared to other servers’ results or previous performance tests on the same server, the relative performance differences can be calculated. Any combination of VM, SQL Server engine, workload database size, and testing intensities can be used to determine the appropriate configuration for a given database workload.

    The usage of HammerDB in the failover testing is to simply drive an application workload against the SQL Server Availability Group database. Similar tools that produce other synthetic or real-world workload replay testing can be used against the Availability Group listener or a stand-alone SQL Server instance.

    For the purpose of the failover testing, HammerDB was configured on a Jumpbox VM to load test the database servers. The application was connected into the Availability Group listener IP address and connected to the TPC-C database. The TPC-C database was restored to its original state between test cycles. The restoration process involved removing the database from the Availability Group, restoring the database from SQL Server-native backup onto the primary replica, re-introducing it into the Availability Group, and then re-synchronizing the secondary replicas in the Availability Group.

    Design considerations

    Availability Groups are the flagship availability feature within SQL Server. As each layer in the infrastructure stack has its own preferred methodology for availability and recovery, the following considerations should be reviewed when architecting Availability Groups on HPE SimpliVity hyperconverged infrastructure.

    VM placement

    SQL Server Availability Groups are designed to help an organization minimize risk by reducing the failure footprint if a single node fails. If a physical virtualization host fails, all the VMs running on that host also go down. VMware HA will restart the VMs onto the remaining host, but those downed VMs are unavailable for a period of time while they reboot.

    If all of the SQL Server Availability Group replica VMs are on the same physical host, and the host were to fail, the entire Availability Group would fail and the business would suffer an outage.

    VMware DRS “anti-affinity” rules can be leveraged to minimize this impact of host failure. A simple anti-affinity rule for the nodes in the Availability Group would ensure that each VM is placed on separate hosts. This configuration would minimize the impact on the data layer if one host fails. The Availability Group would failover to a secondary replica if the primary node were to suffer the failure, and the application would only experience a very short outage while the failover occurs. If a secondary replica were to suffer the failure, no failover would occur, with no outage experienced by the application.

    Configuration changes to vSphere

    Availability Groups leverage certain components of WSFC. To ensure that the inter-node communication occurs with the least possibility of interruption, a new non-routable VLAN and VMware virtual network port group should be used for cluster communication. This isolation ensures that quality of service can be provided for this network to reduce the impact of background network activity.

    In addition to the cluster communication network, a second private network for isolated Availability Group communication should be added.

    Configuration changes to guest

    Two additional network adapters should be added to all SQL Server AG VMs and assigned to the cluster communication and Availability Group traffic networks, respectively. Both network adapters should be of type VMXNET3. The WSFC should be instructed to use this private cluster communication network for its cluster communication route.


    Configuration changes to application

    Virtualization-layer backups and VM host migrations can sometimes cause a very short drop in service or network connectivity. This interruption is not long enough to hinder application performance, but WSFC is extremely sensitive to these sort of interruptions and can trigger a cluster failover. To prevent unnecessary failovers due to normal VM-layer operations, the health monitoring thresholds should be made slightly less sensitive.

    Four settings apply to this health monitoring and the defaults are laid out in the following table (derived from blogs.msdn.com/b/clustering/archive/2012/11/21/10370765.aspx).


    Table 3. Health monitoring thresholds

    Parameter

    Fast failover (Default)

    Relaxed (Recommended)

    Maximum

    SameSubnetDelay

    1 second

    2 second

    2 seconds

    SameSubnetThreshold

    5 heartbeats

    20 heartbeats

    120 heartbeats

    CrossSubnetDelay

    1 second

    2 second

    4 seconds

    CrossSubnetThreshold

    5 heartbeats

    20 heartbeats

    120 heartbeats


    To view the current heartbeat configuration values, issue the following command in an elevated PowerShell prompt:

    PS C:\Windows\system32> get-cluster | fl *subnet*


    image


    To adjust each of the parameters to the recommended relaxed values, issue the following commands.

    PS C:\Windows\system32> (get-cluster).SameSubnetThreshold = 20


    image


    Perform this for each of the four recommended relaxed values.

    (get-cluster).SameSubnetDelay = 2000

    (get-cluster).SameSubnetThreshold = 20

    (get-cluster).CrossSubnetDelay = 2000

    (get-cluster).CrossSubnetThreshold = 20


    In addition to these adjustments, increase the cluster log heartbeat traffic parameter to ensure sufficient logging is in place.

    (get-cluster).RouteHistoryLength = 20

    VM backup recommendations

    The impact of HPE SimpliVity application-consistent backups on the functionality and performance of SQL Server Availability Groups was evaluated by executing scheduled HPE SimpliVity backups while running a moderate HammerDB workload against SQL Server.

    Because of the nature of the data replication inside an active Availability Group configuration, active backups of each SQL Server VM must be handled with care. The nature of the Availability Group database sensitivity to the synchronization between the replicas creates a situation where the databases can be out of sync between the replicas during VM-level backup operations. If the VM-level backups are even slightly out of sync during the backup process, the result can be that the databases inside the Availability Group will not restore successfully and retain the replication properties. The databases would most likely not be available in the Availability Group on the secondary replicas.

    As a result, VM-level backups should be used to back up the Windows® operating system and the SQL Server installation but not be depended on to back up the specific databases that are in Availability Groups. SQL Server-level database backup strategies should be implemented to complement the VM-level backup strategy.

    Ordinarily, the result is that the most recent database backup(s) is restored onto the primary replica instance, re-added into the Availability Group, and then the remaining replicas are re-synced with the primary.

    Testing methodology

    A series of tests were designed and executed to validate real-world customer scenarios, along with the resiliency and performance of the HPE OmniStack for SQL Server.

    Three different availability architectures were tested:

    1. High availability

    2. Disaster recovery

    3. High availability and disaster recovery


    The goal of the exercise was to validate the functionality of HPE SimpliVity 380 as the base platform for simulated production customer SQL Server environments with Availability Groups in use. Three specific goals were established to validate the core functionality and value of the HPE SimpliVity hyperconverged infrastructure for Microsoft SQL Server:

    1. Demonstrate the effectiveness of SQL Server Availability Groups in a highly available configuration (synchronous mode) on an HPE SimpliVity federation.

    2. Demonstrate the effectiveness of SQL Server Availability Groups in a highly redundant configuration (asynchronous mode) on a distributed HPE SimpliVity federation between sites.

    3. Demonstrate the effectiveness of SQL Server Availability Groups for both high availability and disaster recovery with synchronous mode replication in one Federation with additional replicas in asynchronous mode at a second HPE SimpliVity federation in another data center for disaster recovery purposes.


    The performance project consisted of operational configuration and failover testing, conducted to demonstrate the application workload management and availability of HPE SimpliVity 380 using typical SQL Server Availability Groups and traditional OLTP workloads.

    Selected benchmarks or Tests

    The architectural testing covered three primary scenarios across an HPE SimpliVity federation.

    1. The loss of an HPE SimpliVity 380 containing the primary replica of a two-replica Availability Group configuration in synchronous mode.

    2. The loss of the primary data center with the primary replica of a two-replica Availability Group configuration in asynchronous mode.

    3. The loss of one or both replicas in the primary data center in a three-replica Availability Group configuration across two sites. The primary data center contains two Availability Group replicas in synchronous mode and it is stretched to the third replica in an asynchronous configuration.


    Significance of benchmark/Test thresholds

    The significance of the test cycles is to demonstrate the improved speed of replica failover of the Availability Group, as opposed to the failover time of VMware HA. On average, VMware HA can restart a VM from a downed host and results in SQL Server being a backup and operational in an average time of two minutes for 45 seconds. The goal of leveraging the Availability Group failover feature is to demonstrate that it can failover in less than 30 seconds.

    Test results

    Each test used generally available (GA) hardware and software. Solution configurations used for the testing are described in Figure 5. The software used is detailed here:

    This testing used standard Microsoft application configurations, best practices, and the HammerDB benchmarking tool to simulate common Microsoft SQL Server OLTP workloads. HammerDB is described in Workload specifications on page 12. HammerDB was used to construct the TPC-C type testing database, which was 50 GB in size.

    Configurations to support the reference architecture

    The testing configuration began with four HPE SimpliVity 380 nodes. This deployment configuration allows for high availability and data protection locally in each data center, as well as remote backups and disaster recovery for a simulated second site.

    Methodology

    The testing methodology is similar to the approach a database administrator would follow when architecting, configuring, and deploying a new SQL Server environment for production purposes.

    1. Review the workload performance characteristics of an existing application stack and dependent databases, if applicable.

    2. Generate service-level agreements to determine the three critical application availability metrics: MTTR, RPO, and RTO.

    3. Deploy hardware and virtual machines, sized and scaled according to the performance metrics previously gathered, or estimations if the databases are to support a new application.

    4. Configure the SQL Server Availability Group across the necessary HPE SimpliVity 380 systems and add a new database for the workload failover testing.

    5. Configure and execute a HammerDB TPC-C database load test to validate the infrastructure performance of the SQL Server as data is moved between replicas.


    Three availability scenarios were tested on HPE SimpliVity 380.

    1. Test operational failover features of SQL Server AGs in synchronous mode during a HammerDB test cycle (high availability).

    2. Test operational failover features of SQL Server AGs in asynchronous mode during a HammerDB test cycle (disaster recovery).

    3. Test operational failover features of SQL Server AGs in combinations of synchronous and asynchronous mode during a HammerDB test cycle (high availability and disaster recovery).


    The tests also highlighted the best practices for HPE SimpliVity 380, specific to Microsoft SQL Server AGs, which are detailed on page 27.

    SQL Server availability testing configuration

    All SQL Server virtual machines were configured with a default instance of Microsoft SQL Server. The instance-level configuration details are as follows:

  • SQL Server 2014 Service Pack 1 Cumulative Update 2

  • VM configuration: 4 vCPU/32 GB RAM

  • Minimum and maximum SQL Server memory set to 75% of VM memory allocation

  • Optimize for ad-hoc workloads enabled

  • TPC-C database set to autogrow data and log files by 128 MB

  • Instant file initialization enabled

  • MaxDOP set to the number of vCPUs in a vNUMA node (4)

  • Cost threshold for parallelism set to 25

  • Number of TempDB data files equal to the number of vCPUs (four)


The HammerDB TPC-C database was restored from the original backup between each failover test execution.


Failover testing

The goal of the tests was to validate that the Availability Group on HPE SimpliVity architecture can provide the required availability to the HammerDB application workload.

During each failover, as expected, the HammerDB test cycle is interrupted. To reconnect it to the Availability Group database, the test cycle was restarted upon failure.

High availability test results

Two Availability Group replicas (AGSQL01 and AGSQL02) were configured for synchronous data replication. AGSQL03 was not configured for data replication.


image


The TPC-C database was added into the Availability Group, and the HammerDB application was connected to the Availability Group listener AGSQLLS01 and the test cycle started.


image


The Availability Group primary replica was then manually failed over from AGSQL01 to AGSQL02. The failover occurred in five seconds, and the HammerDB application test cycle was halted due to the brief connection failover (as expected).

HammerDB was then reconnected and the test cycle restarted.


image


image


image


Next, the AGSQL02 VM was forcefully reset, simulating a complete loss of the VM and/or host. The Availability Group automatically failed back to AGSQL01 in eight seconds, and the HammerDB application was reconnected and the test cycle resumed.


image


Disaster recovery test results

Two Availability Group replicas (AGSQL01 and AGSQL03) were configured for asynchronous data replication. AGSQL02 was not configured for data replication.


image


The TPC-C database was added into the Availability Group, and the HammerDB application was connected to the Availability Group listener AGSQLLS01, and the test cycle started.


image


The Availability Group primary replica was then manually failed over from AGSQL01 to AGSQL03. Because of the nature of the asynchronous replication and potential for data loss, the process was manual. The failover occurred in nine seconds, and the HammerDB application was reconnected and the test cycle resumed.


image



image



image


image


image



Next, the AGSQL03 VM was forcefully reset, simulating a complete loss of the VM and/or host. The Availability Group was manually failed back to AGSQL01 with a prepared T-SQL script in 14 seconds, and the HammerDB application was reconnected and the test cycle resumed.



image


High availability and disaster recovery test results

Two Availability Group replicas (AGSQL01 and AGSQL02) were configured for synchronous data replication. AGSQL03 was configured for asynchronous data replication to simulate the disaster recovery data replica.


image


The TPC-C database was added into the Availability Group, and the HammerDB application was connected to the Availability Group listener AGSQLLS01 and the test cycle started.

The Availability Group primary replica was then manually failed over from AGSQL01 to AGSQL02. The failover occurred in six seconds and the HammerDB application was reconnected while the test cycle resumed.



image


image



Next, the AGSQL02 VM was forcefully reset, simulating a complete loss of the VM and/or host. The Availability Group automatically failed back to AGSQL01 in nine seconds, and the HammerDB application was reconnected and the test cycle resumed.


image



image



Next, both AGSQL01 and AGSQL02 VMs were forcefully powered off, simulating a complete loss of the primary data center.


image


Because failover to an asynchronous replica is a manual operation, the Availability Group on AGSQL03 was promoted to primary and the databases brought online. The complete process took 16 seconds.


image



image



image


image


image


Conclusions

These availability validation tests were performed to demonstrate that the HPE SimpliVity is capable of providing an easy-to-manage platform for supporting SQL Server workloads that leverage Availability Group technologies. The failover test results are highlighted in the following table.


Table 4. Failover test results



Test description

Failover time

Pass/Fail

Sync node failover from node 1 to 2 (planned)

5 sec

Pass

Sync node failover from node 2 to 1 (unplanned)

8 sec

Pass

Async node failover from node 1 to 3 (planned)

9 sec

Pass

Async node failover from node 3 to 1 (unplanned)

14 sec

Pass

Sync + Async node failover from 1 to 2 with 3 (planned)

6 sec

Pass

Sync + Async node failover from 2 to 1 with 3 (unplanned)

9 sec

Pass

Sync + Async node failover to 3 with loss of 1 & 2 (unplanned)

16 sec

Pass


The resiliency and flexibility of HPE OmniStack is complemented by the power of both synchronously and asynchronously replicated Availability Group databases to provide a combination of availability and performance to improve your organization’s business continuity strategy while simplifying its operational complexity. The data reduction technologies native to the HPE SimpliVity 380 reduce the multiple copies of data footprint on disk while maintaining the performance that businesses demand.

Best practices

Configuration and deployment best practices were incorporated into the build and testing methodology. This section outlines those best practices and highlights the architectural choices that were made.

LAN and WAN infrastructure configuration

Each layer of the infrastructure surrounding the HPE SimpliVity 380 environment can contribute to performance inefficiencies, which can limit the performance and availability of SQL Server Availability Groups when located on HPE SimpliVity 380.

Network bandwidth and latency

Performance of the network between each HPE SimpliVity 380 and each site in the federation can be the limiting factor in application and backup performance. Bandwidth and latency determine how much data can be transferred between the application and the database servers, as well as data replicated per backup cycle.

Availability Groups depend on high throughput between the replicas for synchronous data movement. Performance problems or limited bandwidth will cause this traffic to operate slower than expected, which can impact the performance of the SQL Servers. If the network latency causes the round trip to the secondary synchronous replicas to be delayed, the write operation takes longer, and the overall application performance of the Availability Group will suffer.

Additionally, if the network connection between the highly available site and the disaster recovery site cannot keep up with the asynchronous stream of data from the synchronous replicas, this backlog of unreplicated data can cause the recovery point of the data to grow. If this backlog grows too large, the queue could exceed the organization’s RPO.

Network segmentation

To ensure that both additional layers of network-dependent services of Availability Groups, WSFC and Availability Group listener components, have the proper bandwidth, consider adding two additional non-routable VLANs into the environment for WSFC heartbeat traffic and inter-AG data movement. No background network traffic can intrude into these vital communication channels and steal bandwidth from the processes.

The second network adapter should be configured to use the new WSFC heartbeat virtual network, as defined by the network and VM administrators.

The third network adapter should be configured to use the new Availability Group traffic virtual network, also as defined by the network and VM administrators.

If necessary, additional Quality of Service (QoS) measures can be introduced to better insulate these networks from outside network congestion and to provide priority to these critical channels.


image



image



These new VLANs should be mapped to new VMware network port groups so that they can be mapped properly to secondary virtual network adapters on each AG VM.

Affinity rules

The goal of any highly available architecture is to minimize the number of single points of failure. Availability Groups are the premier feature in SQL Server to help improve the availability, but if the primary and secondary nodes are on the same physical HPE SimpliVity 380 and that server fails unexpectedly, the database is down until the VMs are restarted on a different HPE SimpliVity 380. Leverage VMware Dynamic Resource Scheduler (DRS) anti-affinity rules to enforce the business rule that no two Availability Group replicas should reside on the same HPE SimpliVity 380. If a host fails, the remaining replicas are online and ready to accommodate the failover.


image


SQL Server backups on HPE OmniStack

HPE OmniStack provides the native capability to back up the virtual machines. Because of the architectural constraints with synchronizing Availability Group replica logs for backups, care must be taken during HPE SimpliVity backups. A SQL Server-level database backup strategy is recommended in addition to a VM-level strategy.

The underlying challenges lies in the synchronized nature of the database replication transactional logs. If a replica is not backed up at precisely the same time as the other replicas, the backups are out of sync. If these VMs are then restored as a group, the broken log chain will cause the secondary replica databases to not restore successfully.

As a result, leveraging a two-tier backup strategy for both the virtual machines and the SQL Server-level databases is highly recommended. If an HPE SimpliVity VM backup needs to be restored, the database-level backup can be applied as well to return all replicas in an Availability Group to a normal operational state.

WSFC sensitivity to VM-layer activities

Virtualization-layer backups and VM host migrations can sometimes cause a very short drop in service or network connectivity. This interruption is not long enough to hinder application performance, but WSFC is extremely sensitive to these sort of interruptions, and can trigger a cluster failover. To prevent unnecessary failovers due to normal VM-layer operations, the health monitoring thresholds should be made slightly less sensitive.

Four settings apply to this health monitoring, and the defaults are laid out in the following table (derived from blogs.msdn.com/b/clustering/archive/2012/11/21/10370765.aspx).



Table 5. Health monitoring threshold values

Parameter

Fast failover (Default)

Relaxed (Recommended)

Maximum

SameSubnetDelay

1 second

2 second

2 seconds

SameSubnetThreshold

5 heartbeats

20 heartbeats

120 heartbeats

CrossSubnetDelay

1 second

2 second

4 seconds

CrossSubnetThreshold

5 heartbeats

20 heartbeats

120 heartbeats



To view the current heartbeat configuration values, issue the following command in an elevated PowerShell prompt:

PS C:\Windows\system32> get-cluster | fl *subnet*


image


To adjust each of the parameters to the recommended relaxed values, issue the following commands.

PS C:\Windows\system32> (get-cluster).SameSubnetThreshold = 20


image


Perform this for each of the four recommended relaxed values.

(get-cluster).SameSubnetDelay = 2000

(get-cluster).SameSubnetThreshold = 20

(get-cluster).CrossSubnetDelay = 2000

(get-cluster).CrossSubnetThreshold = 20

In addition to the mentioned adjustments, increase the cluster log heartbeat traffic parameter to ensure sufficient logging is in place.

(get-cluster).RouteHistoryLength = 20

For additional best practices and backup considerations, refer to the Microsoft SQL Server Reference Architecture.


Conclusion

This technical white paper demonstrates the use cases and architectural flexibility with leveraging Microsoft SQL Server Availability Groups on HPE SimpliVity, a market-leading hyperconverged infrastructure solution. The architectural testing, run on HPE SimpliVity-validated reference architectures for SQL Server running on HPE OmniStack, proved that it is the perfect platform on which to run highly available production SQL Server workloads. HPE OmniStack delivers the requisite performance and availability, while enhancing the data protection attributes of SQL Server with the native backup and disaster recovery capabilities.

The availability testing successfully demonstrated the outstanding availability of SQL Server Availability Groups on the HPE OmniStack configured for high availability, disaster recovery, and both high availability and disaster recovery.

Deploying SQL Server Availability Groups on HPE SimpliVity hyperconverged infrastructure allows the architects to achieve an organization’s availability requirements by selecting the right availability features to best suit the specific needs.

HPE SimpliVity 380 has complementary features to SQL Server Availability Groups that will benefit each layer, site, and application of the organization.

References and additional resources

Run SQL Server on your favorite platform

microsoft.com/en-US/sql-server/sql-server-2017?&OCID=AID631226_SEM_1ZninR19

Architecting Microsoft SQL Server on VMware vSphere vmware.com/content/dam/digitalmarketing/vmware/en/pdf/solutions/sql-server-on-vmware-best-practices-guide.pdf

Learn more at

hpe.com/us/en/storage/microsoft.html


image


image image image image


image

Sign up for updates

image


image



image

© Copyright 2018 Hewlett Packard Enterprise Development LP. The information contained herein is subject to change without notice. The only warranties for Hewlett Packard Enterprise products and services are set forth in the express warranty statements accompanying such products and services. Nothing herein should be construed as constituting an additional warranty. Hewlett Packard Enterprise shall not be liable for technical or editorial errors or omissions contained herein.

Microsoft, Windows, and Windows Server are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries. Oracle is a registered trademark of Oracle and/or its affiliates. VMware, VMware vSphere, VMware ESXi, VMware vRealize Automation, VMware vCenter, and VMware vCenter Server are registered trademarks or trademarks of VMware, Inc. in the United States and/or other jurisdictions. All other third-party marks are property of their respective owners.

a00053626ENW, August 2018