VMTraining Blog: August 2012

Monday, August 13, 2012

VMware on NetApp NFS

This information was taken from a NetApp white paper by Bikash Roy Choudhury – TR-3839. I am just pulling out snippets that I thought were very interesting. Not all of it is specific to NetApp, but most of my consulting work is on NetApp.

Deduplication - A VMware datastore has a high level of data duplication. This is because it contains many instances of the same guest operating systems, application software, and so on. NetApp deduplication technology can eliminate this duplicated data, dramatically reducing the amount of storage required by your VMware environment. Space savings range from 50% to 90%, with 70% being typical.
Strongly consider using CAT6 cables rather than CAT5/5e. GbE will work on Cat5 cable and retransmissions will recover from any errors that occur, but they have a more significant impact on IP storage networks.
NFS NETWORKS AND STORAGE PERFORMANCE - There are three primary measures of storage performance: bandwidth (MBps), throughput in I/O operations per second (IOPS), and latency (ms). Throughput and bandwidth are related in the sense that the throughput measured in Mbps equals to IOPs multiplied by the I/O size. The I/O size is the size of the I/O operation from the host perspective.IOPS are usually determined by the back-end storage configuration. If the workload is cached, then it‘s determined by the cache response; most often, it‘s determined by the spindle configuration for the storage object. In the case of NFS datastores, the storage object is the file system. On a NetApp storage system, the IOPS achieved are primarily determined by the number of disk drives in an aggregate. You should also know that each NFS datastore mounted by ESX (including vSphere) uses just one TCP session to an NFS datastore carrying both NFS control information and NFS data. Because of this, the upper limit throughput achievable from a single ESX host for a single datastore—regardless of link aggregation—is a single link. If you use 1GbE, this means that a reasonable expectation is a unidirectional workload of ~80–100MB/sec (GbE is full duplex, so this can be 160MB/sec bidirectionally with a mixed workload). Higher total throughput on an ESX server can be achieved by leveraging multiple datastores. You can scale up the total throughput to multiple datastores using link aggregation and routing mechanisms.
The performance described above is sufficient for many use cases. Based on these numbers, NFS is appropriate for various VMware scenarios:
- A shared datastore supporting many VMs with an aggregate throughput requirement within the guidelines above (a large number of IOPS, but generally not large-block I/O)
- A single busy VM as long as its I/O load can be served by a single GbE link
With small-block I/O (8K), a GbE connection delivers 12,500 IOPS—roughly the performance of 70 15K spindles. On the other hand, a SharePoint VM tends to use I/O sizes of 256K or larger. Using 256K I/O sizes yields just 390 IOPS, which would likely be a problem. Under such circumstances, 10GbE may be the best performance option. If you use 10GbE—though a single TCP session will be used per datastore—much more throughput is available for demanding workloads. If 10GbE isn‘t an option, you can use NFS for some VMs over 1GbE and FC or iSCSI for others depending on the application throughput and latency requirements.
Partition Alignment - For NFS, there is no VMFS layer involved, so only the alignment of the guest VM file system within the VMDK to the NetApp storage array is required.The correct alignment for a new VM can be set using either diskpart to format a partition with the correct offset or by using fdisk from the ESX service console. In practice to avoid further creating of improperly aligned VMs, you must create templates that are properly aligned so new virtual machines built using those templates are properly aligned. NetApp has created a tool: mbralign, to check and correct the alignment of existing virtual machines.
THIN PROVISIONING - As mentioned above, the ability to take best advantage of thin provisioning is a major benefit of using NetApp NFS. VMware thin provisions the VMDK files on NFS datastores by default, but there are two types of thin-provisioned virtual disk files available:
- “Thick” type thin-provisioned virtual disk.This type of virtual disk file is created by default on NFS datastores during the virtual machine creation process. It has the following properties:
  - Creates a flat .VMDK file; does not occupy actual disk blocks (thin provisioned) until there is a physical write from the guest OS
  - Guaranteed disk space reservation
  - Cannot oversubscribe the disk space on the NFS datastore
- “Thin” type thin-provisioned virtual disk.You must create this type of virtual disk file using the vmkfstools command. It‘s properties are:
  - Creates a flat .VMDK file; does not occupy actual disk blocks (thin provisioned) until there is a physical write from the guest OS
  - No guaranteed disk space reservation
  - Can oversubscribe the disk space on the NFS datastore
You can run the following command against an NFS datastore to show its actual disk space utilization:
- # vdf -h /vmfs/volumes/<NFS Datastore Name>
Using the “thin” type of virtual disk you have the option to oversubscribe the storage capacity of a datastore, allocating more space than a datastore actually contains on the assumption that VMs will not all use the capacity allocated to them. This can be a very attractive option; however, it has a few limitations that you must know about before you implement it.
Should an oversubscribed datastore encounter an out-of-space condition, all of the running VMs will become unavailable. The VMs simply “pause” waiting for space, but applications running inside of VMs may fail if the out-of-space condition isn’t addressed in a short period of time. For example, Oracle databases will remain active for 180 seconds; after that time has elapsed the database will fail.
High Availability and Disaster Recovery - NetApp recommends the following ESX failover timeout settings for NFS. We recommend increasing the default values to avoid VMs being disconnected during a failover event. NetApp VSC can configure these settings automatically. The settings that NetApp recommends (across all ESX hosts) are:
- NFS.HeartbeatFrequency (NFS.HeartbeatDelta in vSphere) = 12
- NFS.HeartbeatTimeout = 5(default)
- NFS.HeartbeatMaxFailures = 10. When the number of NFS datastores are increased, we also recommend increasing the heap values: Net.TcpipHeapSize =>’30′ to Net.TcpipHeapMax => ‘120′
  1. Back up your Windows registry.
  2. Select Start>Run, type regedit.exe, and click OK.
  3. In the left‐panel hierarchy view, double-click HKEY_LOCAL_MACHINE, then System, then CurrentControlSet, then Services, and then Disk.
  4. Select the TimeOutValue and set the data value to 190 (decimal).
Every “HeartbeatFrequency” (or 12 seconds) the ESX server checks to see that the NFS datastore is reachable. Those heartbeats expire after “HeartbeatTimeout”(or 5 seconds), after which another heartbeat is sent. If “HeartbeatMaxFailures”(or 10 heartbeats) fail in a row, the datastore is marked as unavailable and the VMs crash.
This means that the NFS datastore can be unavailable for a maximum of 125 seconds before being marked unavailable, which covers the large majority of NetApp failover events.
During this time period, a guest sees a non-responsive SCSI disk on the vSCSI adapter. The disk timeout is how long the guest OS will wait when the disk is non-responsive. Use the following procedure to set operating system timeouts for Windows servers to match the 190-second maximum set for the datastore:
1. Back up your Windows registry.
2. Select Start>Run, type regedit.exe, and click OK.
3. In the left‐panel hierarchy view, double-click HKEY_LOCAL_MACHINE, then System, then CurrentControlSet, then Services, and then Disk.
4. Select the TimeOutValue and set the data value to 190 (decimal).

Thursday, August 2, 2012

Understanding the Risks and Returns of Enterprise Cloud Computing

Excerpt taken from the Akamai whitepaper “Taming the Cloud: Understanding the Risks and Returns of Enterprise Cloud Computing”

As one of the hottest buzzwords in IT today, and often misused, the term “cloud computing” has been the subject of much hype and much confusion. But the potential is real: with its promise of low-opex, zero-capex, on-demand infrastructure, cloud computing offers highly appealing economic and business benefits, such as accelerated innovation and time-to-market, that have given it traction among small and startup businesses, as it gives them low-cost, easy access to true enterprise-grade technology — that could otherwise cost millions to build. For these and other reasons, cloud computing has also drawn the cautious but serious interest of larger enterprises.

Merrill Lynch “conservatively” predicts a $160 billion cloud computing market by 2011, including $95 billion in enterprise business applications.This magnitude makes one thing clear: cloud computing is too important a trend to ignore. Forrester Research agrees, “Cloud computing has all the earmarks of being a potential disruptive innovation that all infrastructure and operations professionals should heed.” Yet while cloud computing holds tremendous potential, regardless of what cloud computing taxonomy one adopts, the cloud is the Internet – and this weak link introduces a number of challenges - particularly within the enterprise market.

Cloud Optimization Services are required to address the enterprise market, and these services must go well beyond Content Delivery Network (CDN) cache-based technologies to remove the cloud-based barriers and accelerate enterprise adoption and realization of cloud computing’s benefits.

How Will the Enterprise Use Cloud Computing?

In addition to using these tiered cloud computing offerings, larger enterprises with existing, mature online channels will leverage a mix of public cloud, private cloud, and origin data center services. The ability to migrate and run components of Web applications across various cloud platforms — based on the business requirements of the application — will be a fundamental tenet of how enterprises will migrate to the cloud. A single site may use IaaS services for storage overflow, PaaS services for custom application modules and best of breed SaaS applications, along with on-premises origin systems. Some enterprises will even establish private clouds, creating a pool of infrastructure resources, typically deployed within their own firewall, that can be dynamically shared among different applications and functions within the enterprise.

Consider that the Internet is the common link between all these cloud computing modules, introducing its specific issues around performance, reliability, scalability, and security…thus, real world cloud computing implementations will include challenges presented by multiple cloud offering integrations as well as challenges inherent to the Internet cloud itself.

Enterprise-ready Cloud Computing Requirements

While cloud computing has gained significant traction among startups and small businesses, Enterprises require the following for cloud computing to deliver on its promise of creating a far more efficient, flexible, and cost-effective infrastructure for their IT needs:

Performance: As enterprises think about shifting from a LAN-sized, on-premises solution to a cloud-based offering, application performance becomes a key consideration. The performance of any centrally-hosted Web application — including cloud computing applications — is inextricably tied to the performance of the Internet as a whole — including its thousands of disparate networks and the tens of thousands of connection points between them. Latency, network outages, peering point congestion, and routing disruptions are among some of the problems intrinsic to the Internet that prevent make it difficult to rely on for business-critical transactions.

Reliability: The numerous recent, high-profile outages at many of the major cloud computing providers highlight the need to provide the high availability solutions enterprises demand, as even small amounts of down time can cost their companies millions in lost revenue and productivity. In addition, wide-scale network problems caused by trans-oceanic cable cuts, power outages, and natural disasters, can severely disrupt communications across large regions of the globe.

Security: Companies worry about loss of control and security when moving applications outside their firewall onto virtual infrastructure — where physical machines and locations are unknown. The Internet introduces new security issues including distributed denial-of-service (DDoS) attacks, DNS attacks, and even application-specific risks such as cross-site scripting attacks and SQL injections. Regulatory and legal compliance requirements present further challenges.

Visibility and Control: Cloud offerings need to provide enterprise-grade support, including robust logging, reporting, and monitoring tools that provide visibility into the infrastructure. Moreover, the Internet, with its many moving parts, presents a complex system to troubleshoot when things go wrong.

Ease of Integration: As most clouds are proprietary, they often require new skill sets as well as re-architecting or re-writing existing applications in order to take advantage of cloud benefits. Enterprises want solutions that allow them to leverage their heavy investment in their legacy applications. This challenge is compounded by the modular, multiple-cloud application solution strategies needed by large enterprises.

SLAs: Service-level agreements (SLAs) are rare among cloud computing providers. And while larger providers offer 99.9% uptime SLAs, this simply isn’t enough good enough for business-critical applications. In addition, these SLAs usually refer to the uptime of the cloud service provider’s own infrastructure, rather than the more relevant measure of availability to end users.

There are aspects of each of these cloud computing requirements that can only be addressed by dealing with Internet issues. To illustrate, while some cloud computing vendors (PaaS providers in particular) talk about providing scale and reliability for their offerings, they are typically talking in reference to the “origin” or first-mile infrastructure that they provide, not to the whole cloud. They may provide automated server failover or a virtual database cluster with automated replication, for example. However, these services are useless against the bottlenecks in the cloud that can adversely affect the end user experience.

This underscores the critical need for underlying Cloud Optimization Services that can tame the cloud — services that will enable cloud computing to reach its true potential.

Wednesday, August 1, 2012

Cloud Computing 101

Excerpt taken from the Compuware whitepaper “Performance in the cloud”

Cloud computing provides on-demand, real-time network access to shared resources that can be physically located anywhere across the globe. From an enterprise point of view, it represents a fundamental shift in the way to acquire and implement new services (computing power, storage, software, etc.) to support the business. Instead of internally developed, monolithic systems, or lengthy and costly implementations of customized third-party business solutions, cloud computing provides an agile and flexible environment with shorter solution implementation cycles and much lower initial cost.

The business benefits of cloud computing include lower IT costs—new hardware and software functionality is acquired on a metered-service basis rather than through capital expenditure—and increased business agility—application enhancements to support changing business needs no longer rely on internally developed or customized third-party software. Also, since cloud applications are inherently web- and Internet-based, enterprises can quickly and easily extend new functionality to customers via their web sites.

As an example, consider a common financial application such as mortgage loan preapproval. As a legacy application, the client-based software contains all the business logic necessary to guide the customer service representative (CSR) through the pre-approval process. Several steps may require manual intervention—accessing property tax records, insurance data and credit history—because the information does not exist inside the enterprise and can change very quickly. The net result is either a delay in pre-approval or an inaccurate estimate of closing and servicing costs to the customer, both of which can result in lost business. Provided that the legacy application uses a modern architecture (.NET for instance), it is relatively straightforward to add various web (cloud) services to the application to access the missing information, resulting in faster (and more accurate) decision making.

Existing enterprise applications that have been upgraded with cloud-based services are commonly known as “composite applications.” They offer a “best of both worlds” approach to the business: They protect the significant investment in existing software, but they also provide a rapid, relatively inexpensive way to add functionality to meet changing business needs. Additionally, with a browser-based user interface, the application can be offered to customers as a self-service option through the enterprise’s web site—this can have a significant positive impact on the bottom line, since the site is open for business 24x7. In the case of mortgage pre-approvals, for instance, a number of financial institutions now offer web-based applications which can provide on-the-spot decisions, with a complete, accurate listing of closing and servicing costs. The application saves money because no CSR is involved in the pre-approval process, and boosts business because consumers are less likely to shop around if they receive a fast, accurate pre-approval.

Cloud computing actually provides even greater flexibility—and potential cost savings—by allowing enterprises not only to leverage existing applications in the cloud (Software-as-a-Service); it also allows the enterprise to deploy applications on a service provider’s servers (Platform-as-a-Service), and to leverage the availability of processing, storage, networks and other fundamental computing resources owned by service providers (Infrastructure-as-a-Service).